Say you want to calculate the most popular hashtags from a live Twitter stream. A straightforward way of doing this is using a counter. When a tweet comes in, you look for the hashtags and you update your histogram:
from collections import Counter freqs = Counter() for hashtag in hashtags: freqs[hashtag] += 1
Simple. For a small number (~10^6-ish) of tweets this is fine. However if we want to scale it up to billions of tweets, memory usage will be unacceptable because the space we need is roughly proportional to the number of unique hashtags, as well as the size of the integer associated with said hashtag.
The solution is of course to have a 'good enough' estimate, using a probablistic counter called Count-Min-Sketch. I think of it as a generalised version of the Bloom Filter. It takes in two parameters, the width (w) and depth (d).
It is just d arrays which contains w counters. For instance a CMS (count-min-sketch) with w = 4 and d = 2 will result in the arrays shown on the left. We also need to have d different hash functions for each array. Also we'll use zero-based counting for the examples that follow.
If we want to add an element v, for each of hash functions
hi, assuming that the hash function returns integer
values in the range [0, w - 1] we increment the counter at
array i, given by hi(v). For instance if
h0('abc') == 1
h1('abc') == 3 we'll end up with the one shown
on the left.
Say we added another value which results in a hash collision:
h0('def') == 0,
h1('def') == 3. This
is where the min part of the algorithm comes in: if we
want to get the frequency of
'abc', we must first
hash it again, using our hash functions, and get the value of
abc example we will have values 1 and 2,
h0(v) == 1 and
h1(v) == 2.
We then take the minimum of those values as probably the
It is worth noting that the amount of frequencies we can remember is potentially infinite- but at some point the error rates become really high and it is meaningless, but we can do it with constant space of w × d × size-of(counter). However this datastructure comes with two problems: