How to count without counting
I’ve been messing about with a few different things lately, one of which is the HyperLoglog algorithm. Others have already written about this extensively, so I have nothing terribly new to contribute; I just wanted briefly discuss what it is and show some code.
While it’s a little zen sounding, HyperLogLog is a way of counting the number of unique items—say, words in a file—without actually counting them. That is, it gives a (rather good) estimate of the cardinality of the set of elements in a given multiset while taking up much less memory than instantiating the whole set (or a hashed version thereof), and it requires only a single pass through the data with some smaller post-processing on the end for a runtime of \( O(N) \).
The larger picture
I think HyperLogLog is wicked cool. More generally, though, I think it’s a great example of a type of algorithm that deserves more exploration, especially in this era of big data: algorithms that trade resource efficiency for approximate answers. A big part of what’s so cool about HyperLogLog is that the authors of this paper give a highly mathematical and probabilistic description of the possible error and then mitigate that in the final version of the algorithm. It’s very impressive.
For a (slapdash?) C++11 version of the final algorithm, head here.