Posted by: Sameer Agarwal | March 2, 2008

Bloom Filters: Designing a Spellchecker

Bloom filters are highly space/time efficient probabilistic data structures that are used to solve the membership problem of a set, that is, given an element, it is used to find out whether the particular element belongs to the set or not. False positives can occur in the results (though we can really minimize its probability) but false negatives are not allowed.

Ever wondered how the browser/text editor spell checkers work? They tend to take some bare minimum space and are very time efficient. You guessed it right : Bloom Filters is the key! I have designed a Spellchecker in C language using bloom filters and would like to explain how it works, thereby explaining bloom filters in general.

The image “http://www.cs.ucla.edu/~rlaufer/gbf/gbf.gif” cannot be displayed, because it contains errors.

Now, the data structure of the spell checker is basically a m-bit array. Moreover we have k hash functions which act on strings and return a number between 1 to m-1. For inserting each word in the dictionary, we calculate its hash and set the corresponding bit it generated to be 1. So for each word, it is acted upon by k hash functions and correspondingly k bits are set to 1. Similarly, other words are also added in a similar fashion. Now, for testing membership, we calculate the k hashes of the word to be tested and if either of the bit corresponding to the generated position is found to be zero, we reject it. Clearly, false positives are possible (the bit being set to 1 by some other string) but false negatives can never be possible.

And now, the coolest part. Let us mathematically play around with the probability of false positives :

Assume that a hash function selects each array position with equal probability. The probability that a certain bit is not set to one by a certain hash function during the insertion of an element is then

1-frac{1}{m}.

The probability that it is not set by any of the hash functions is

left(1-frac{1}{m}right)^k.

If we have inserted n elements, the probability that a certain bit is still 0 is

left(1-frac{1}{m}right)^{kn};

the probability that it is 1 is therefore

1-left(1-frac{1}{m}right)^{kn}.

Now test membership of an element that is not in the set. Each of the k array positions computed by the hash functions is 1 with a probability as above. The probability of all of them being 1, which would cause the algorithm to erroneously claim that the element is in the set, is then

left(1-left(1-frac{1}{m}right)^{kn}right)^k approx left(1-e^{-kn/m}right)^k.

Obviously, the probability of false positives decreases as m (the number of bits in the array) increases, and increases as n (the number of inserted elements) increases. For a given m and n, the value of k (the number of hash functions) that minimizes the probability is

frac{m}{n}ln 2 approx frac{9m}{13n} approx 0.7frac{m}{n},

which gives the false positive probability of

left(frac{1}{2}right)^{k} approx {0.6185}^{m/n}.

OK, there was quite a bit of catch in assuming that the “hash function selects each array position with equal probability”. Theoretically this is not quite possible. Randomness is not to random in the world of computer science, but it is nevertheless a good approximation (or at least I would LIKE to believe that)

Implementation

Source code of the spell checker can be obtained here

The spell checker presently has 3 options :

1. Add a new word : This basically requires making the word go through k hash functions (k = 1 1 in my implementation) and the setting of corresponding bits in the bit array. The standard hash functions I used are:

  • RSHash
  • JSHash
  • PJWHash
  • ELFHash
  • BKDRHash
  • SDBMHash
  • DJBHash
  • DEKHash
  • BPHash
  • FNVHash
  • APHash

These functions are a beauty in themselves, I will will also blog about them soon
2. Query a word : Requires the word to be operated on by k hash functions are returned true if all the corresponding bits are set.

3. Recreate optimized bit array : Now, as shown above, the value of k that minimizes the probability of false positives

frac{m}{n}ln 2 approx frac{9m}{13n} approx 0.7frac{m}{n}

So, here we have k fixed and n depends on the number of words in the dictionary. So, we can only edit m, i.e. the number of bits in the bit array. Therefore, whenever we needs to recreate the optimized bit array, we simply re-calculate m based on this equation and re-hash the words.

Operation 1 and 2 are constant time operations while operation 3 is quite expensive and must be used once in a while. Moreover, just an interesting fact is that a Bloom filter with 1% error and an optimal value of k, on the other hand, requires only about 9.6 bits per element — regardless of the size of the elements! This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. If a 1% false positive rate seems too high, each time we add about 4.8 bits per element we decrease it by ten times.

References :

[1] http://en.wikipedia.org/wiki/Bloom_filter

[2] http://michael.dipperstein.com/bitlibs/

[3] http://www.partow.net/programming/hashfunctions/index.html

About these ads

Responses

  1. Whoaaa!

  2. many spell checkers use some notion of edit-distance to return a list of possible likely candidate words for each misspelled word. how would you do that using a bloom filter?

  3. This is pretty cool. Nice post.

  4. I like to poo poo i like to pee pee, and now you see the kaa kaa and the tee tee

  5. You could use this for dictionary lookup, but it would make for a truly half-assed spellchecker.

  6. Nice first post! I look forward to reading more articles by you in the future.

  7. @john
    Such an implementation of closest possible matches can be done by implementing a different set of special hashes which have closer numerical values if words are related. eg. pickle and tickle may have hash values 599 and 601. I will write an elaborated post on this soon :)

    Moreover, we already have closest match algorithms such as Ispell (by Geoff Kuenning) and GNU Aspell (by Kevin Atkinson). These are probably the best spell checkers that exist today. the advantage of Ispell is primarily its huge language support (one of the important motives of any project) whilst Aspell’s primary advantage is that it is better at making suggestions when a word is seriously misspelled. For example, when given “trubble”, ispell will suggest only “rubble”, where aspell suggests “trouble” (as its first choice) as well as “dribble”, “rubble”, and a lot of other words. Its disadvantage is that the approximate-matching algorithm is specific to English.

  8. @wkuc, Aron
    Thanks guys!

  9. This is nice, you have explained the bloom filters very well, I remember resorting to a distributed mechanism to decide whether a word is member or not… while coding a game during my 2ndbtech… that was long back… Although my challenge was 2 fold, 1 to find the membership and 2 to find whether it was a word at all (no access to any source of words you see)…

    very good post once again.. added my reader :)

  10. Wow now that was fascinating indeed. Totally blew my mind.

    JT
    http://www.FireMe.To/udi

  11. Bloom filters are a poor choice for spell checkers. You would do much better with a GADDAG, DAGGAD or DAWG.

  12. When I saw this in my first look into data structures we used an inferior approach to tackling the issue of spell checking.

  13. […] Bloom Filters: Designing a Spellchecker « I POWER INFINITY (tags: algorithms programming algorithm filter bloom-filter spellcheck bloom datastructures reference) […]

  14. You don’t need a big list of hash functions. You really only need one high quality function. To get 10 different hash values with the same key and hash function, just hash it 10 times but each time you add a different prefix. If the hash function is of good quality, then it will be just like having 10 independent hash functions.

  15. Ok, that was a good approach to solving ‘a problem’ by doing hash table key look-ups, but it’s more of a beginning of a spell checker.

    nice read

  16. Hash Functions: Not only complete randomness but also that the used hash functions are independent. E.g. covariance is diag(1,1,1,…1) between the hashes generated from the same key. THIS is difficult to prove.

  17. u r on reddit!!!

  18. Using 11 different hash functions is overkill. In fact, in some applications, hashing is the bottleneck. See here: http://www.cc.gatech.edu/fac/Pete.Manolios/pub/bloom-filters-verification.pdf

  19. Pretty Cool Data Structure.. Great Post

  20. I’m an intelligent guy, but every time I view something in the programing sub-reddit I feel dumber than a chilled turd. Upmod me if you didn’t understand a fucking thing on that page.

  21. […] Bloom Filters: Designing a Spellchecker « I POWER INFINITY (tags: algorithms algorithm computer data development dictionary programming reference research science search text tutorial) […]

  22. +1 on the hash thing. Apart from that your article is very nice, and will probably be helpful for a number of people.
    Maybe you could erase this part of your article, because it might be misleading for many people?

  23. so u’re already off the mark!
    great goin bro!

  24. […] Bloom Filters: Designing a Spellchecker « I POWER INFINITY Posted on July 1st, 2008 by george […]

  25. […] element, it is used to find out whether the particular element belongs to the set or not. False poshttp://ipowerinfinity.wordpress.com/2008/03/02/bloom-filters-designing-a-spellchecker/Amazon.com: &quotStructured Problem Solving&quot: Key Phrase pageKey Phrase page for structured […]

  26. […] store one bit that determines if the word is correct, as described in Programming Pearls book and Bloom filters blog post. In this case, the size of hash table can be even less than the size of ternary DAG, however, an […]

  27. […] Instead of keeping a word in the hash table, you can store one bit that determines if the word is correct, as described in Programming Pearls book and Bloom Filters blog post. […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: