Bloom filters are highly space/time efficient probabilistic data structures that are used to solve the membership problem of a set, that is, given an element, it is used to find out whether the particular element belongs to the set or not. False positives can occur in the results (though we can really minimize its probability) but false negatives are not allowed.
Ever wondered how the browser/text editor spell checkers work? They tend to take some bare minimum space and are very time efficient. You guessed it right : Bloom Filters is the key! I have designed a Spellchecker in C language using bloom filters and would like to explain how it works, thereby explaining bloom filters in general.
Now, the data structure of the spell checker is basically a m-bit array. Moreover we have k hash functions which act on strings and return a number between 1 to m-1. For inserting each word in the dictionary, we calculate its hash and set the corresponding bit it generated to be 1. So for each word, it is acted upon by k hash functions and correspondingly k bits are set to 1. Similarly, other words are also added in a similar fashion. Now, for testing membership, we calculate the k hashes of the word to be tested and if either of the bit corresponding to the generated position is found to be zero, we reject it. Clearly, false positives are possible (the bit being set to 1 by some other string) but false negatives can never be possible.
And now, the coolest part. Let us mathematically play around with the probability of false positives :
Assume that a hash function selects each array position with equal probability. The probability that a certain bit is not set to one by a certain hash function during the insertion of an element is then
The probability that it is not set by any of the hash functions is
If we have inserted n elements, the probability that a certain bit is still 0 is
the probability that it is 1 is therefore
Now test membership of an element that is not in the set. Each of the k array positions computed by the hash functions is 1 with a probability as above. The probability of all of them being 1, which would cause the algorithm to erroneously claim that the element is in the set, is then
Obviously, the probability of false positives decreases as m (the number of bits in the array) increases, and increases as n (the number of inserted elements) increases. For a given m and n, the value of k (the number of hash functions) that minimizes the probability is
which gives the false positive probability of
OK, there was quite a bit of catch in assuming that the “hash function selects each array position with equal probability”. Theoretically this is not quite possible. Randomness is not to random in the world of computer science, but it is nevertheless a good approximation (or at least I would LIKE to believe that)
Source code of the spell checker can be obtained here
The spell checker presently has 3 options :
1. Add a new word : This basically requires making the word go through k hash functions (k = 1 1 in my implementation) and the setting of corresponding bits in the bit array. The standard hash functions I used are:
These functions are a beauty in themselves, I will will also blog about them soon
2. Query a word : Requires the word to be operated on by k hash functions are returned true if all the corresponding bits are set.
3. Recreate optimized bit array : Now, as shown above, the value of k that minimizes the probability of false positives
So, here we have k fixed and n depends on the number of words in the dictionary. So, we can only edit m, i.e. the number of bits in the bit array. Therefore, whenever we needs to recreate the optimized bit array, we simply re-calculate m based on this equation and re-hash the words.
Operation 1 and 2 are constant time operations while operation 3 is quite expensive and must be used once in a while. Moreover, just an interesting fact is that a Bloom filter with 1% error and an optimal value of k, on the other hand, requires only about 9.6 bits per element — regardless of the size of the elements! This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. If a 1% false positive rate seems too high, each time we add about 4.8 bits per element we decrease it by ten times.