Bloom Filters

Bloom filters are a useful technique invented a very long time ago and also invented a number of years ago by myself!

The premise is that you have an incoming stream of data items from a very large pool, think car registration numbers, domain names, credit card numbers etc and you need to see if they match a rogue list that is also so large it cannot sensibly be kept in memory. These days of course the solution is probably to make the problem go away by adding more memory, but we are looking for a more sensible solution.

The first obvious solution is to create a bitmap and populate bits according to a hash. You will get clashes in a finite length bitmap, but if things are sparsely populated then you only need to call out to the slow database when your test case finds a populated bit. If your bitmap is 1% occupied then 99% of the time you do not need to call the database. A big win!

The Bloom filter takes the creative leap forward, if you can have one bit map, then why not two of three or even more. If the second bitmap uses a different hash, then we have 1% of 1% of 99.99%of tests not needing to call the database.

Our example code shows how this works,if we have a finite bit of memory for our bitmap what will we find with one row of 100 bits, two rows of 50 bits and so on.

This in one output from the program.

Population Density 10 in 100 bits 10.000000%
Rows 1 map size 100 false positives 973 Percent 9.730000% clashes 0
Rows 2 map size 50 false positives 408 Percent 4.080000% clashes 0
Rows 4 map size 25 false positives 110 Percent 1.100000% clashes 9
Rows 5 map size 20 false positives 115 Percent 1.150000% clashes 10
Rows 10 map size 10 false positives 133 Percent 1.330000% clashes 35
Rows 20 map size 5 false positives 1041 Percent 10.410000% clashes 110
Rows 25 map size 4 false positives 1021 Percent 10.210000% clashes 158
Rows 50 map size 2 false positives 10000 Percent 100.000000% clashes 400
Rows 100 map size 1 false positives 10000 Percent 100.000000% clashes 900

I encourage you to compile with any c compiler and play with the numbers to get a feel for things. For example How does the occupancy level impact the results.

// bloom filter
// gcc bloom.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

// the constants you can play with
#define SIZE 100            // total bits used in filter these will be divided between rows
#define MAXROWS 100          // we will test permutations up to this number of rows
#define POPULATION_COUNT 10 // number of hits stored
#define TESTS 10000         // number of random items to test
#define STRING_LENGTH 10    // string length of the items we randomly generate

//
// For our simulation to work we need a reasonable proficient hashing
// algorithm that will produce a randomly different result for
// each row.
//
unsigned long hash(const char *str, unsigned long hash)
{
    unsigned int i = 0;

    int length = strlen(str);

    for (i = 0; i < length; ++str, ++i)
    {
        hash ^= ((hash << 5) + (*str) + (hash >> 2));
    }

    return hash;
}

// hashing algorithms tend to work better base on primes
// Exercise: Does it really matter?
static int IsPrime(int number)
{
    for (int i = 2; i < number; i++)
    {
        if (number % i == 0 && i != number)
            return 0;
    }
    return 1;
}

//
// create test/hit item
//
char *random_string()
{
    char *s = calloc(STRING_LENGTH + 1, 1);

    for (int i = 0; i < STRING_LENGTH; i++)
    {
        s[i] = 'a' + rand() % 26; // may as well be printable
    }
    return s;
}

int main(int argc, char *argv[])
{

    unsigned long seeds[MAXROWS];

    srand(time(0));

    // create unique random seeds for our hashing algorithm
    for (int i = 0; i < MAXROWS; i++)
    {
        int dup;
        do
        {
            dup = 0;
            seeds[i] = rand();
            if (!IsPrime(seeds[i])) // Only use primes?
                continue;
            // very unlikely we would duplicate, but lets check anyway
            for (int j = 0; j < i; j++)
            {
                if (seeds[i] == seeds[j])
                {
                    dup = 1;
                    break;
                }
            }
        } while (dup);
    }

    printf("Population Density %d in %d bits  %f%%\n", POPULATION_COUNT, SIZE, (float)100 * POPULATION_COUNT / (float)SIZE);
    //
    // Our bitmap size is SIZE, this will be split over one bitmap, then two bitmaps
    // all the way down to MAXROWS, you can take it all the way so a bitmap has only
    // bit which should show all false postitives
    //
    for (int tr = 1; tr <= MAXROWS; tr++)
    {
        if (SIZE % tr != 0)
        {
            continue; // only check for bitmaps that will use dead on SIZE bits overall
        }

        int clashes = 0;
        unsigned char **bitmaps; // hey I'm using chars for bits, this is example code not a real bloom filter

        // allocate the bitmap arrays
        bitmaps = malloc(tr * sizeof(char *));
        int mapsize = SIZE / tr;
        for (int r = 0; r < tr; r++)
        {
            bitmaps[r] = calloc(mapsize, 1);
        }

        //
        // populate with positives
        //
        for (int i = 0; i < POPULATION_COUNT; i++)
        {
            char *positive = random_string();

            //printf("%s\n",positive);

            // created populate each row
            for (int r = 0; r < tr; r++)
            {

                unsigned long offset = hash(positive, seeds[r]) % mapsize;
                if (bitmaps[r][offset] == 1)
                    clashes++;
                bitmaps[r][offset] = 1;
            }
            free(positive);
        }

        // do the bitmaps look sensible. e.g. no patterns
        /*for (int r = 0; r < tr; r++)
          {
               for (int i = 0; i<mapsize;i++)
               {
                     printf("%c",bitmaps[r][i]==0?'0':'1');
               }
                  printf("\n");
          }*/
        int false_matches = 0;

        // generate test cases
        for (int i = 0; i < TESTS; i++)
        {

            //char out[2000];
            char *sample = random_string();
            int alert = 1; // start off thinking it may be positive
                           // created populate each row
            //out[0] = '\0';
            for (int r = 0; r < tr; r++)
            {

                unsigned long offset = hash(sample, seeds[r]) % mapsize;
                //printf(" %ld ",offset);
                // sprintf(&out[strlen(out)],"(%ld %ld %ld)",offset,hash(sample, seeds[r]),seeds[r]);
                //sprintf(&out[strlen(out)],"(%ld)",offset);
                if (bitmaps[r][offset] == 0)
                {
                    alert = 0;
                    break;
                }
            }
            //if (alert)printf("%s %s\n",sample,out);
            false_matches += alert;
            free(sample);
        }
        printf("Rows %d map size %d false positives %d Percent %f%% clashes %d\n", tr, mapsize, false_matches, (float)100 * false_matches / (float)TESTS, clashes);

        // free the bitmap arrays
        for (int r = 0; r < tr; r++)
        {
            free(bitmaps[r]);
        }

        free(bitmaps);
    }
}

Search This Blog

Martin Walker

Bloom Filters

Comments

Post a Comment

Popular posts from this blog

battery claims

A reliable tuya authentication in node-red

General relationship manager