Array intersections -part 1

 When a database performs a query there is every likelihood sets of matching records find their was into multiple arrays of record numbers that need filtering against each other to arrive at a final set of records. I even myself wrote a database that pretty much exclusively used this method for index.

Intersecting common numbers in two or more arrays might sound a little trivial, but there are rather a lot of papers written on how to do this optimally.

As such this is going to be a series of articles exploring the subject. One of the things we will be covering along with speed is memory usage.

We will start off badly! Though we will still use -O for compiler optimisation. It's unfair not to give the compiler at least a chance to improve on whatever rubbish we have thrown at it.

Here's the github link

//gcc -O bad.cpp -lstdc++
#include <stdlib.h>
#include <stdio.h>
#include <set>
#include <iterator>
#include "common.h"
#include <time.h>

using namespace std;

int main(int args, char **argc)
{
clock_t start, end;
double cpu_time_used;
uint32_t *array1 = GetRandomArray(LONG_LENGTH);
uint32_t *array2 = GetRandomArray(LONG_LENGTH);
uint32_t *array3 = (uint32_t *)malloc(LONG_LENGTH * sizeof(uint32_t));

// we want to time the actual work part
start = clock();
set<uint32_t> set1;

for (int i = 0; i < LONG_LENGTH; i++)
set1.insert(array1[i]);

int finds = 0;
for (int i = 0; i < LONG_LENGTH; i++)
{
if (set1.find(array2[i]) != set1.end())
array3[finds++] = array1[i];
}
end = clock();
cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
printf("Total %d Intersect %d time %f seconds\n", LONG_LENGTH, finds, cpu_time_used);
}

Running this I get:

Total 1000000 Intersect 48836 time 1.252631 seconds

Is that good or bad, well at this point we technical don't know. However if we do:

memusage ./a.out

We will see:

Total 1000000 Intersect 48836 time 0.459821 seconds

Memory usage summary: heap total: 52073728, heap peak: 52073728, stack peak: 2256
        total calls   total memory   failed calls
malloc|    1000005       52073728              0
realloc|          0              0              0  (nomove:0, dec:0, free:0)
calloc|          0              0              0
  free|    1000000       40000000
Histogram for block sizes:
  32-47        1000000  99% ==================================================
1024-1039            1  <1%  
  large              4  <1%


Now that does look bad, we have made a million memory allocation calls. Why have we made such a lot of calls, well my first thought was probably largely as set1 has no idea how large it will grow to be, we actually do know and so it's pretty inexcusable for our code to be quite this awful. Lets make a small change.

//gcc -O stillbad.cpp -lstdc++
#include <stdlib.h>
#include <stdio.h>
#include <set>
#include <iterator>
#include "common.h"
#include <time.h>

using namespace std;

int main(int args, char **argc)
{
clock_t start, end;
double cpu_time_used;
uint32_t *array1 = GetRandomArray(LONG_LENGTH);
uint32_t *array2 = GetRandomArray(LONG_LENGTH);
uint32_t *array3 = (uint32_t *)malloc(LONG_LENGTH * sizeof(uint32_t));

// we want to time the actual work part
start = clock();
set<uint32_t> set1(array1,array1+LONG_LENGTH);
int finds = 0;
for (int i = 0; i < LONG_LENGTH; i++)
{
if (set1.find(array2[i]) != set1.end())
array3[finds++] = array1[i];
}
end = clock();
cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
printf("Total %d Intersect %d time %f seconds\n", LONG_LENGTH, finds, cpu_time_used);
}

We are  now using a constructor that uses the array itself and hence can preallocate space or so we might think.

This time:

memusage ./a.out

Total 1000000 Intersect 48836 time 0.220658 seconds

Memory usage summary: heap total: 52073728, heap peak: 52073728, stack peak: 2272
        total calls   total memory   failed calls
malloc|    1000005       52073728              0
realloc|          0              0              0  (nomove:0, dec:0, free:0)
calloc|          0              0              0
  free|    1000000       40000000
Histogram for block sizes:
  32-47        1000000  99% ==================================================
1024-1039            1  <1%  
  large              4  <1%


We seem to have halved the time, but we are still a memory monster! If we look closer at what memusage has told us, we see the histogram data well us the million allocations are for small blocks of memory, so plainly "set" has something complicated going on.

But seriously why are we using "set"? it's pretty lazy programming. So lets start being sensible.

//gcc -O sensible.cpp -lstdc++
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "common.h"
#include <time.h>

using namespace std;

int main(int args, char **argc)
{
clock_t start, end;
double cpu_time_used;
uint32_t *array1 = GetRandomArray(LONG_LENGTH);
uint32_t *array2 = GetRandomArray(LONG_LENGTH);
uint32_t *array3 = (uint32_t *)malloc(LONG_LENGTH * sizeof(uint32_t));

// we want to time the actual work part
start = clock();
int finds = 0;
for (int s1 = 0,s2=0; s1 < LONG_LENGTH && s2<LONG_LENGTH; )
{
if ( array1[s1] == array2[s2]){
array3[finds++] = array1[s1];
s1++;
s2++;
}else if (array1[s1] > array2[s2]){
s2++;
}else{
s1++;
}
}
end = clock();
cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
printf("Total %d Intersect %d time %f seconds\n", LONG_LENGTH, finds, cpu_time_used);
}

This time be bother to think for ourselves and step through the arrays we know are ordered.

Total 1000000 Intersect 48836 time 0.009072 seconds

Memory usage summary: heap total: 12001024, heap peak: 12001024, stack peak: 2160
        total calls   total memory   failed calls
malloc|          4       12001024              0
realloc|          0              0              0  (nomove:0, dec:0, free:0)
calloc|          0              0              0
  free|          0              0
Histogram for block sizes:
1024-1039            1  25% ================
  large              3  75% ==================================================

There is not surprisingly no comparison. We use a fraction of the memory, hardly allocate at all, and are 25x faster.

So is this the end of the story? Not by a long way if you need things truly optimal, but that is where profiling comes in. Though even before profiling you might take time to consider your actual input arrays.

We have thus far worked with two arrays of similar long length. But what if you know that one array is very large and one array very small. Taking things to an extreme  if there is one value in the second array then our execution time will be in proportion to half the length of the first array, a clear linear function of length to time. For more than one value I would expect the time to be some function of the two lengths and be worse than linear.

Part 2 will address improving on this.








Comments

Popular posts from this blog

A reliable tuya authentication in node-red

node-red a full home automation example

Arduino boards - what I wish I'd known when starting