Posts

Showing posts with the label uniques

HyperLogLog or estimating the cardinality of a multiset

 That's a mouthful of a title already. Let's break down firstly what it means? Take a log of all mobile numbers called in a day? how many numbers were unique? So store the numbers in a database and have a calls_made field? the number of rows is the cardinality right? Well yes right but that's a pretty big dataset and we just want one little bit of data and we're not even keen on invading privacy at the level. The same goes for web requests there are lots of permutations in a request, referrer, ip address, device, country, error codes where our interest lies in the number of distinct events that happen.   Estimating cardinality distributions in network traffic Is an example. So what is HyperLogLog, in essence a way of doing this with a heck of a lot less memory. The HyperLogLog algorithm can estimate cardinalities of > 10 9 with a typical accuracy of 2%, using 1.5 kB of memory.  How does this minor miracle occur? It works with probabilities. To pick an ...