Bloom Filters: Designing a Spellchecker

Posted by: Sameer Agarwal | March 2, 2008

Bloom Filters: Designing a Spellchecker

Bloom filters are highly space/time efficient probabilistic data structures that are used to solve the membership problem of a set, that is, given an element, it is used to find out whether the particular element belongs to the set or not. False positives can occur in the results (though we can really minimize its probability) but false negatives are not allowed.

Ever wondered how the browser/text editor spell checkers work? They tend to take some bare minimum space and are very time efficient. You guessed it right : Bloom Filters is the key! I have designed a Spellchecker in C language using bloom filters and would like to explain how it works, thereby explaining bloom filters in general.

The image “http://www.cs.ucla.edu/~rlaufer/gbf/gbf.gif” cannot be displayed, because it contains errors.

Now, the data structure of the spell checker is basically a m-bit array. Moreover we have k hash functions which act on strings and return a number between 1 to m-1. For inserting each word in the dictionary, we calculate its hash and set the corresponding bit it generated to be 1. So for each word, it is acted upon by k hash functions and correspondingly k bits are set to 1. Similarly, other words are also added in a similar fashion. Now, for testing membership, we calculate the k hashes of the word to be tested and if either of the bit corresponding to the generated position is found to be zero, we reject it. Clearly, false positives are possible (the bit being set to 1 by some other string) but false negatives can never be possible.

And now, the coolest part. Let us mathematically play around with the probability of false positives :

Assume that a hash function selects each array position with equal probability. The probability that a certain bit is not set to one by a certain hash function during the insertion of an element is then

$1-frac{1}{m}$ .

The probability that it is not set by any of the hash functions is

$left(1-frac{1}{m}right)^k$ .

If we have inserted n elements, the probability that a certain bit is still 0 is

$left(1-frac{1}{m}right)^{kn}$ ;

the probability that it is 1 is therefore

$1-left(1-frac{1}{m}right)^{kn}$ .

Now test membership of an element that is not in the set. Each of the k array positions computed by the hash functions is 1 with a probability as above. The probability of all of them being 1, which would cause the algorithm to erroneously claim that the element is in the set, is then

$left(1-left(1-frac{1}{m}right)^{kn}right)^k approx left(1-e^{-kn/m}right)^k$ .

Obviously, the probability of false positives decreases as m (the number of bits in the array) increases, and increases as n (the number of inserted elements) increases. For a given m and n, the value of k (the number of hash functions) that minimizes the probability is

$frac{m}{n}ln 2 approx frac{9m}{13n} approx 0.7frac{m}{n}$ ,

which gives the false positive probability of

$left(frac{1}{2}right)^{k} approx {0.6185}^{m/n}$ .

OK, there was quite a bit of catch in assuming that the “hash function selects each array position with equal probability”. Theoretically this is not quite possible. Randomness is not to random in the world of computer science, but it is nevertheless a good approximation (or at least I would LIKE to believe that)

Implementation

Source code of the spell checker can be obtained here

The spell checker presently has 3 options :

1. Add a new word : This basically requires making the word go through k hash functions (k = 1 1 in my implementation) and the setting of corresponding bits in the bit array. The standard hash functions I used are:

RSHash
JSHash
PJWHash
ELFHash
BKDRHash
SDBMHash
DJBHash
DEKHash
BPHash
FNVHash
APHash

These functions are a beauty in themselves, I will will also blog about them soon
2. Query a word : Requires the word to be operated on by k hash functions are returned true if all the corresponding bits are set.

3. Recreate optimized bit array : Now, as shown above, the value of k that minimizes the probability of false positives

$frac{m}{n}ln 2 approx frac{9m}{13n} approx 0.7frac{m}{n}$

So, here we have k fixed and n depends on the number of words in the dictionary. So, we can only edit m, i.e. the number of bits in the bit array. Therefore, whenever we needs to recreate the optimized bit array, we simply re-calculate m based on this equation and re-hash the words.

Operation 1 and 2 are constant time operations while operation 3 is quite expensive and must be used once in a while. Moreover, just an interesting fact is that a Bloom filter with 1% error and an optimal value of k, on the other hand, requires only about 9.6 bits per element — regardless of the size of the elements! This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. If a 1% false positive rate seems too high, each time we add about 4.8 bits per element we decrease it by ten times.

References :

[1] http://en.wikipedia.org/wiki/Bloom_filter

[2] http://michael.dipperstein.com/bitlibs/

[3] http://www.partow.net/programming/hashfunctions/index.html

Posted in bloom filters, computer science, hashing

Responses

Whoaaa!
By: Strange Jones on June 9, 2008
at 5:17 pm

Reply
many spell checkers use some notion of edit-distance to return a list of possible likely candidate words for each misspelled word. how would you do that using a bloom filter?
By: john smith on June 9, 2008
at 5:59 pm

Reply
This is pretty cool. Nice post.
By: wcuk on June 9, 2008
at 6:10 pm

Reply
I like to poo poo i like to pee pee, and now you see the kaa kaa and the tee tee
By: Poopoo on June 9, 2008
at 7:35 pm

Reply
You could use this for dictionary lookup, but it would make for a truly half-assed spellchecker.
By: Nick on June 9, 2008
at 7:43 pm

Reply
Nice first post! I look forward to reading more articles by you in the future.
By: Aron Ahmadia on June 9, 2008
at 8:21 pm

Reply
@john
Such an implementation of closest possible matches can be done by implementing a different set of special hashes which have closer numerical values if words are related. eg. pickle and tickle may have hash values 599 and 601. I will write an elaborated post on this soon 🙂

Moreover, we already have closest match algorithms such as Ispell (by Geoff Kuenning) and GNU Aspell (by Kevin Atkinson). These are probably the best spell checkers that exist today. the advantage of Ispell is primarily its huge language support (one of the important motives of any project) whilst Aspell’s primary advantage is that it is better at making suggestions when a word is seriously misspelled. For example, when given “trubble”, ispell will suggest only “rubble”, where aspell suggests “trouble” (as its first choice) as well as “dribble”, “rubble”, and a lot of other words. Its disadvantage is that the approximate-matching algorithm is specific to English.
By: Sameer Agarwal on June 9, 2008
at 8:32 pm

Reply
@wkuc, Aron
Thanks guys!
By: Sameer Agarwal on June 9, 2008
at 8:33 pm

Reply
This is nice, you have explained the bloom filters very well, I remember resorting to a distributed mechanism to decide whether a word is member or not… while coding a game during my 2ndbtech… that was long back… Although my challenge was 2 fold, 1 to find the membership and 2 to find whether it was a word at all (no access to any source of words you see)…

very good post once again.. added my reader 🙂
By: Chandoo on June 9, 2008
at 9:27 pm

Reply
Wow now that was fascinating indeed. Totally blew my mind.

JT
http://www.FireMe.To/udi
By: Jimmy TwoTone on June 9, 2008
at 10:41 pm

Reply
Bloom filters are a poor choice for spell checkers. You would do much better with a GADDAG, DAGGAD or DAWG.
By: StoneCypher on June 9, 2008
at 11:00 pm

Reply
When I saw this in my first look into data structures we used an inferior approach to tackling the issue of spell checking.
By: alphaJ on June 10, 2008
at 12:26 am

Reply
[…] Bloom Filters: Designing a Spellchecker « I POWER INFINITY (tags: algorithms programming algorithm filter bloom-filter spellcheck bloom datastructures reference) […]
By: links for 2008-06-10 « Donghai Ma on June 10, 2008
at 4:30 am

Reply
You don’t need a big list of hash functions. You really only need one high quality function. To get 10 different hash values with the same key and hash function, just hash it 10 times but each time you add a different prefix. If the hash function is of good quality, then it will be just like having 10 independent hash functions.
By: Ken Bateman on June 10, 2008
at 6:20 am

Reply
Ok, that was a good approach to solving ‘a problem’ by doing hash table key look-ups, but it’s more of a beginning of a spell checker.

nice read
By: Arek on June 10, 2008
at 6:34 am

Reply
Hash Functions: Not only complete randomness but also that the used hash functions are independent. E.g. covariance is diag(1,1,1,…1) between the hashes generated from the same key. THIS is difficult to prove.
By: DavidS on June 10, 2008
at 7:04 am

Reply
u r on reddit!!!
By: Ankush on June 10, 2008
at 7:19 am

Reply
Using 11 different hash functions is overkill. In fact, in some applications, hashing is the bottleneck. See here: http://www.cc.gatech.edu/fac/Pete.Manolios/pub/bloom-filters-verification.pdf
By: michaelw on June 10, 2008
at 1:15 pm

Reply
Pretty Cool Data Structure.. Great Post
By: Akhil Ravidas on June 10, 2008
at 2:57 pm

Reply
I’m an intelligent guy, but every time I view something in the programing sub-reddit I feel dumber than a chilled turd. Upmod me if you didn’t understand a fucking thing on that page.
By: web design company on June 10, 2008
at 4:17 pm

Reply
[…] Bloom Filters: Designing a Spellchecker « I POWER INFINITY (tags: algorithms algorithm computer data development dictionary programming reference research science search text tutorial) […]
By: links for 2008-06-11 on June 11, 2008
at 12:37 am

Reply
+1 on the hash thing. Apart from that your article is very nice, and will probably be helpful for a number of people.
Maybe you could erase this part of your article, because it might be misleading for many people?
By: poulejapon on June 11, 2008
at 4:37 am

Reply
so u’re already off the mark!
great goin bro!
By: goverdhan on June 18, 2008
at 1:41 pm

Reply
[…] Bloom Filters: Designing a Spellchecker « I POWER INFINITY Posted on July 1st, 2008 by george […]
By: Saving the Band » Bloom Filters: Designing a Spellchecker « I POWER INFINITY on July 1, 2008
at 7:48 pm

Reply
[…] element, it is used to find out whether the particular element belongs to the set or not. False poshttps://ipowerinfinity.wordpress.com/2008/03/02/bloom-filters-designing-a-spellchecker/Amazon.com: &quotStructured Problem Solving&quot: Key Phrase pageKey Phrase page for structured […]
By: structured problem solving on July 26, 2008
at 7:51 am

Reply
[…] store one bit that determines if the word is correct, as described in Programming Pearls book and Bloom filters blog post. In this case, the size of hash table can be even less than the size of ternary DAG, however, an […]
By: ternary_dags [smallcode] on January 3, 2009
at 3:24 am

Reply
[…] Instead of keeping a word in the hash table, you can store one bit that determines if the word is correct, as described in Programming Pearls book and Bloom Filters blog post. […]
By: Using ternary DAGs for spelling correction on January 10, 2009
at 1:34 am

Reply

	Sameer Agarwal on Fun with GTalk/Pidgin: Changin…
	emdub on Fun with GTalk/Pidgin: Changin…
	Ravi on Fun with GTalk/Pidgin: Changin…
	Sameer Agarwal on Should I ask her out? It…
	Vikram on Should I ask her out? It…

I POWER INFINITY