HyperLogLog or estimating the cardinality of a multiset
That's a mouthful of a title already. Let's break down firstly what it means?
Take a log of all mobile numbers called in a day? how many numbers were unique?
So store the numbers in a database and have a calls_made field? the number of rows is the cardinality right?
Well yes right but that's a pretty big dataset and we just want one little bit of data and we're not even keen on invading privacy at the level.
The same goes for web requests there are lots of permutations in a request, referrer, ip address, device, country, error codes where our interest lies in the number of distinct events that happen.
Estimating cardinality distributions in network traffic
Is an example.
So what is HyperLogLog, in essence a way of doing this with a heck of a lot less memory. The HyperLogLog algorithm can estimate cardinalities of > 109 with a typical accuracy of 2%, using 1.5 kB of memory.
How does this minor miracle occur?
It works with probabilities. To pick an analogy if I told you I had thrown a coin 8x and got all heads you'd guess I'd tried for this a lot more than once.
We can do something with a datum that is analogous to tossing a coin. When we take a good hash of a datum the top bit has a 50/50 chance of being 0 and so does each successive bit.
If we take the hashes a load of unique items and record the highest number of leading zero bits as 8 then once again we would guess we had quite a lot of unique items.
But it would be a bit of a guess as freak results do occur.
To find out true odds when tossing coins, the trick is to repeat the experiment several times and use the law of averages.
HLH works on the same sort of principle, to make averages work in our favour we split our incoming data evenly between buckets by taking a very good hash of the number and calling say the lowest 8 bits an index into the buckets.
We then count the leading binary zeros in the hash and if it exceed the number previously in the bucket store that number in the bucket. Higher numbers of zeros are of course increasingly unlikely, this is our coin toss!
Any individual bucket might end up with a bit of a freak result. But over a load of buckets an average will happen that especially with a large dataset will make sense. The higher the number of average leading zeros the more unique entries we are likely to have put into the HLH. Therefore we can estimate that number if only some clever clogs has worked out the math involved? and of course they have:
The math if you really must know
So let's create a little test program. I think our language of choice here is C++, might actually be useful as more than a test that way.
I like to keep programs down to what can be seen in the blog, but there's just too many lines of code for this one, so here's the github link:
We will test by create a random data set with a known cardinality.
When we run this we get the following:
Sample 400 Unique 100 Estimated Unique 103
Sample 400 Unique 100 Estimated Unique 96
Union Estimated Unique 120
Our two sets on random data have given not too great answers when using 7 bits as our bucket size, but the union of the two sets is by a stroke of chance spot on.
The union operation is an interesting one. Since we are recording most leading zeros in each bucket, when merging buckets we just take each corresponding bucket with the bigger number. To state the obvious only datasets using the same number of buckets can work like this.
Please play with the constants at the top of the file to see different results.
Footnote
There is more that can be said on HyperLogLog, the figures my version gives are slightly less accurate than possible as there is some extra maths that can be applied to cope with distortions caused by hash collisions.
As with a lot of algorithms which are lossy, you have to ask yourself what are you really seeking. If I want to see one combination gets a million unique values and another half that number I really don't care that much about extreme accuracy. If I really care about accuracy then maybe I don't want HyperLogLog anyway as perhaps my range of unique values only runs into the hundreds?
Comments
Post a Comment