Simple Count-Min-Sketch

Say you want to calculate the most popular hashtags from a live Twitter stream. A straightforward way of doing this is using a counter. When a tweet comes in, you look for the hashtags and you update your histogram[1]:

from collections import Counter
freqs = Counter()

for hashtag in hashtags:
    freqs[hashtag] += 1

Simple. For a small number (~10^6-ish) of tweets this is fine. However if we want to scale it up to billions of tweets[2], memory usage will be unacceptable because the space we need is roughly proportional to the number of unique hashtags, as well as the size of the integer associated with said hashtag.

The solution is of course to have a 'good enough' estimate, using a probablistic counter called Count-Min-Sketch. I think of it as a generalised version of the Bloom Filter. It takes in two parameters, the width (w) and depth (d).

0000
0000

It is just d arrays which contains w counters. For instance a CMS (count-min-sketch) with w = 4 and d = 2 will result in the arrays shown on the left. We also need to have d different hash functions for each array. Also we'll use zero-based counting for the examples that follow.

0100
0001

If we want to add an element v, for each of hash functions hi, assuming that the hash function returns integer values in the range [0, w - 1] we increment the counter at array i, given by hi(v). For instance if h0('abc') == 1 and h1('abc') == 3 we'll end up with the one shown on the left.

1100
0002

Say we added another value which results in a hash collision: h0('def') == 0, h1('def') == 3. This is where the min part of the algorithm comes in: if we want to get the frequency of 'abc', we must first hash it again, using our hash functions, and get the value of the counters.

In the abc example we will have values 1 and 2, since h0(v) == 1 and h1(v) == 2. We then take the minimum of those values as probably the correct frequency.

It is worth noting that the amount of frequencies we can remember is potentially infinite- but at some point the error rates become really high and it is meaningless, but we can do it with constant space of w × d × size-of(counter). However this datastructure comes with two problems: