A Bloom Filter is a 'set' that doesn't store any elements,
and trades correctness for space. It is a probabalistic
data structure- it tells whether something is *probably
within* the set, or whether it definitely isn't.

It is typically implemented using a bit array- essentially a list of True or False values. The size of that list determines how resistant the bloom filter will be to false positives.

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

A Bloom filter will be (informally) defined by a set of hash
algorithms (not necessarily cryptographic) **H** and the
size of the bit vector **N** (in the above example it is 10).
It needs to implement two operations, `add`

and
`test`

.

When we `add`

an element, we set the result of
hashing the element using each hash algorithm from **H**
in the bit array to 1. For example if one algorithm returns 1
and the other returns 3, the bit array would be:

0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |

**Implementation detail:** one can always limit the output
of the hash algorithm to within 0 and **N-1** (inclusive)
by 'mod'-ing the result to base **N** (the `%`

operator in most languages). For instance the FNV family of
hash algorithms returns a 32-bit integer. We can just take
that integer mod 10 for our bit example.

When we `test`

an element, we check if all
the results of hashing the element using each algorithm
is 1 in the bit array. For instance if one returns 1
and the other returns 2, then the element is definitely
not in the set, because the value at index 2 is 0.

We need multiple hash algorithms to reduce the chances of a
false positive given the space constraints, since we can't
(or rather, don't want to) store the full hash-space (usually
2^{32}). However do not go overboard with too many
hash algorithms- your filter will fill up too quickly,
giving too many false positives.

It is also easy to see that adding and testing for elements
has **O(k)** time complexity, where **k** is the number
of hash algorithms, or **k = |H|**. Some tried and tested
hash functions include
the FNV family,
Google's CityHash,
and murmurhash
used in BerkleyDB and other databases.

An unoptimised, simple example:

import fnvhash DEFAULT = [ fnvhash.fnv0_32, fnvhash.fnv1_32, fnvhash.fnv1a_32, ] class BloomFilter(object): def __init__(self, algorithms=DEFAULT, base=100): self.algorithms = tuple(algorithms)) self.storage = [False for _ in range(base)] self.base = base def __contains__(self, item): return all( self.storage[algo(item) % self.base] for algo in self.algorithms ) def add(self, item): for algo in self.algorithms: self.storage[algo(item) % self.base] = True

Note that the above example depends on the excellent
py-fnvhash
library. Be sure to have it `pip install`

-ed
before you run the example.