Say you want to calculate the most popular hashtags from a live Twitter stream. A straightforward way of doing this is using a counter. When a tweet comes in, you look for the hashtags and you update your histogram[1]:

from collections import Counter freqs = Counter() for hashtag in hashtags: freqs[hashtag] += 1

Simple. For a small number (~10^6-ish) of tweets this is fine. However if we want to scale it up to billions of tweets[2], memory usage will be unacceptable because the space we need is roughly proportional to the number of unique hashtags, as well as the size of the integer associated with said hashtag.

The solution is of course to have a 'good enough' estimate, using
a probablistic counter called Count-Min-Sketch. I think of it as
a generalised version of the Bloom Filter. It takes in two
parameters, the width (**w**) and depth (**d**).

0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 |

It is just **d** arrays which contains **w** counters.
For instance a CMS (count-min-sketch) with **w = 4** and
**d = 2** will result in the arrays shown on the left.
We also need to have **d** different hash functions for
each array. Also we'll use zero-based counting for the examples
that follow.

0 | 1 | 0 | 0 |

0 | 0 | 0 | 1 |

If we want to add an element **v**, for each of hash functions
**h _{i}**, assuming that the hash function returns integer
values in the range

`h0('abc') == 1`

and `h1('abc') == 3`

we'll end up with the one shown
on the left.
1 | 1 | 0 | 0 |

0 | 0 | 0 | 2 |

Say we added another value which results in a hash collision:
`h0('def') == 0`

, `h1('def') == 3`

. This
is where the *min* part of the algorithm comes in: if we
want to get the frequency of `'abc'`

, we must first
hash it again, using our hash functions, and get the value of
the counters.

In the `abc`

example we will have values 1 and 2,
since `h0(v) == 1`

and `h1(v) == 2`

.
We then take the minimum of those values as probably the
correct frequency.

It is worth noting that the amount of frequencies we can remember
is potentially infinite- but at some point the error rates become
really high and it is meaningless, but we can do it with constant
space of **w × d × size-of(counter)**. However this datastructure
comes with two problems:

- We must accept a small, but predictable and in some cases non-negligible error rate.
- Like a bloom filter we cannot iterate over the elements in the counter. i.e. there's no way to get a list of elements the counter has 'seen'.