Showing posts from August, 2021

Making a "map" of things

Computers are fast these days fast enough to have allowed scripting languages like php to be viable. This allows for lazy programming, so I thought it might be interesting to revisit a problem I encountered many years ago. The problem/solution is simplified here, so never mind that you don't see why it is solved this way, it make sense in the actual context. We have a stream of queries keeping count of events to a database. The queries are all of the form: UPDATE table SET total=total+1 WHERE key = blah This stream is too much for the database to cope with. So instead of loads of queries we make a string map of the queries adding 1 to a total against the query when we see a duplicate. So the query becomes: UPDATE table SET total=total+__TOTAL__ WHERE key = blah Every so often we iterate the map, substitute __TOTAL__ and start again. In practice this reduced load to about 10% of what it was. So lets program this naively using std::map  // gcc -O ordered_map.cpp -lstdc++ #include &l

Making a "hash" of things

 I seem to talk a lot about "hashes" and it's easy to assume that people know all about them. Put simply a hash function takes an input and outputs a numeric value. The value output can be used in checksums, to speed data storage and retrieval and a million other purposes. Hash functions are varied any may have special properties. the SHA-1 hash for example produces a long 160bit hash value that is so unlikely to be the same for different inputs, it is used to index all revisions of all files in "git" repositories on the assumption no two files or their revisions will ever have the same SHA-1 hash code. SHA-1 is of course a little complex. A simpler use of a hash is if we are storing a series of values and what to see if a particular value is stored. If we put things in an unordered array, all we can do is look at each element of the array in turn until we either find the value or have looked at all elements. This would be very slow. An alternative would be keep

HyperLogLog or estimating the cardinality of a multiset

 That's a mouthful of a title already. Let's break down firstly what it means? Take a log of all mobile numbers called in a day? how many numbers were unique? So store the numbers in a database and have a calls_made field? the number of rows is the cardinality right? Well yes right but that's a pretty big dataset and we just want one little bit of data and we're not even keen on invading privacy at the level. The same goes for web requests there are lots of permutations in a request, referrer, ip address, device, country, error codes where our interest lies in the number of distinct events that happen.   Estimating cardinality distributions in network traffic Is an example. So what is HyperLogLog, in essence a way of doing this with a heck of a lot less memory. The HyperLogLog algorithm can estimate cardinalities of > 10 9 with a typical accuracy of 2%, using 1.5 kB of memory.  How does this minor miracle occur? It works with probabilities. To pick an analogy if I

Learning by example code

 There are any number of programming resources on the web and this blog is not seeking to compete with any of them. However there are also loads of subjects that are interesting and there is fun to be had and learning to be done if they are made more accessible. I like resources that try and keep things simple and as a programmer I like to run code. I was prompted to start this blog by a site which got the maths wrong on a topic I wanted to understand. It didn't look right looking at the page and I bit the bullet and tried to code the thing. Sure enough it was a strange typo that totally transformed what would happen.  So this blog will try and create some programming diversions. The language will often be C as C is kind of universally present, but that will not always be the case as you can't show say a golang technique in C! The emphasis is on code you can understand. Not on creating library functions for actual use. Though is some cases functions that are usable in your own

Bloom Filters

Bloom filters are a useful technique invented a very long time ago and also invented a number of years ago by myself! The premise is that you have an incoming stream of data items from a very large pool, think car registration numbers, domain names, credit card numbers etc and you need to see if they match a rogue list that is also so large it cannot sensibly be kept in memory. These days of course the solution is probably to make the problem go away by adding more memory, but we are looking for a more sensible solution. The first obvious solution is to create a bitmap and populate bits according to a hash. You will get clashes in a finite length bitmap, but if things are sparsely populated then you only need to call out to the slow database when your test case finds a populated bit. If your bitmap is 1% occupied then 99% of the time you do not need to call the database. A big win! The Bloom filter takes the creative leap forward, if you can have one bit map, then why not two of three