Bloom Filters Explained

A Bloom Filter is a 'set' that doesn't store any elements, and trades correctness for space. It is a probabalistic data structure- it tells whether something is probably within the set, or whether it definitely isn't.

It is typically implemented using a bit array- essentially a list of True or False values. The size of that list determines how resistant the bloom filter will be to false positives.

0 0 0 0 0 0 0 0 0 0

A Bloom filter will be (informally) defined by a set of hash algorithms (not necessarily cryptographic) H and the size of the bit vector N (in the above example it is 10). It needs to implement two operations, add and test.

When we add an element, we set the result of hashing the element using each hash algorithm from H in the bit array to 1. For example if one algorithm returns 1 and the other returns 3, the bit array would be:

0 1 0 1 0 0 0 0 0 0

Implementation detail: one can always limit the output of the hash algorithm to within 0 and N-1 (inclusive) by 'mod'-ing the result to base N (the % operator in most languages). For instance the FNV family of hash algorithms returns a 32-bit integer. We can just take that integer mod 10 for our bit example.

When we test an element, we check if all the results of hashing the element using each algorithm is 1 in the bit array. For instance if one returns 1 and the other returns 2, then the element is definitely not in the set, because the value at index 2 is 0.

We need multiple hash algorithms to reduce the chances of a false positive given the space constraints, since we can't (or rather, don't want to) store the full hash-space (usually 232). However do not go overboard with too many hash algorithms- your filter will fill up too quickly, giving too many false positives.

It is also easy to see that adding and testing for elements has O(k) time complexity, where k is the number of hash algorithms, or k = |H|. Some tried and tested hash functions include the FNV family, Google's CityHash, and murmurhash used in BerkleyDB and other databases.

An unoptimised, simple example:

import fnvhash

DEFAULT = [
    fnvhash.fnv0_32,
    fnvhash.fnv1_32,
    fnvhash.fnv1a_32,
]

class BloomFilter(object):
    def __init__(self, algorithms=DEFAULT, base=100):
        self.algorithms = tuple(algorithms))
        self.storage = [False for _ in range(base)]
        self.base = base

    def __contains__(self, item):
        return all(
            self.storage[algo(item) % self.base]
            for algo in self.algorithms
            )

    def add(self, item):
        for algo in self.algorithms:
            self.storage[algo(item) % self.base] = True

Note that the above example depends on the excellent py-fnvhash library. Be sure to have it pip install-ed before you run the example.