A Bloom Filter is a 'set' that doesn't store any elements, and trades correctness for space. It is a probabalistic data structure- it tells whether something is probably within the set, or whether it definitely isn't.
It is typically implemented using a bit array- essentially a list of True or False values. The size of that list determines how resistant the bloom filter will be to false positives.
A Bloom filter will be (informally) defined by a set of hash
algorithms (not necessarily cryptographic) H and the
size of the bit vector N (in the above example it is 10).
It needs to implement two operations,
add an element, we set the result of
hashing the element using each hash algorithm from H
in the bit array to 1. For example if one algorithm returns 1
and the other returns 3, the bit array would be:
Implementation detail: one can always limit the output
of the hash algorithm to within 0 and N-1 (inclusive)
by 'mod'-ing the result to base N (the
operator in most languages). For instance the FNV family of
hash algorithms returns a 32-bit integer. We can just take
that integer mod 10 for our bit example.
test an element, we check if all
the results of hashing the element using each algorithm
is 1 in the bit array. For instance if one returns 1
and the other returns 2, then the element is definitely
not in the set, because the value at index 2 is 0.
We need multiple hash algorithms to reduce the chances of a false positive given the space constraints, since we can't (or rather, don't want to) store the full hash-space (usually 232). However do not go overboard with too many hash algorithms- your filter will fill up too quickly, giving too many false positives.
It is also easy to see that adding and testing for elements has O(k) time complexity, where k is the number of hash algorithms, or k = |H|. Some tried and tested hash functions include the FNV family, Google's CityHash, and murmurhash used in BerkleyDB and other databases.
An unoptimised, simple example:
import fnvhash DEFAULT = [ fnvhash.fnv0_32, fnvhash.fnv1_32, fnvhash.fnv1a_32, ] class BloomFilter(object): def __init__(self, algorithms=DEFAULT, base=100): self.algorithms = tuple(algorithms)) self.storage = [False for _ in range(base)] self.base = base def __contains__(self, item): return all( self.storage[algo(item) % self.base] for algo in self.algorithms ) def add(self, item): for algo in self.algorithms: self.storage[algo(item) % self.base] = True
Note that the above example depends on the excellent
library. Be sure to have it
before you run the example.