Array intersections - part 2

 To recap we have been looking at algorithms to find the intersection of two arrays e.g. the values they have in common.

We have moved from a simple but very poor way of doing things, to thinking more sensibly and coming up with an algorithm that worked about 25x faster.

However we have made the observation that if our smallest array is very small our method is instinctively inefficient as it involves a linear search of the large array.

The most common alternative to a linear search is a binary chop where we repeatedly divide the array in two searching the part where our element may still lie until we find it, or find it is not there.

However we have been talking about these arrays possibly being database indexes where we can assume numbers will be spread evenly across a range.

If our element is  towards the top of the range, then splitting down the middle seems inefficient. Hence let's leap straight to an interpolation search which will try and aim with a bit more accuracy.

//gcc -O interpolation.cpp -lstdc++
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "common.h"
#include <time.h>
#include <cassert>

using namespace std;

#define SPLITS 30
#define LONG_SPLITS 4
#define MAX_SHORT (LONG_LENGTH / 1000)

template <typename T>
int interpolation_search(T arr[], int size, T key)
{
int low = 0;
int high = size - 1;
int mid;

while ((arr[high] != arr[low]) && (key >= arr[low]) && (key <= arr[high]))
{
mid = low + ((key - arr[low]) * (high - low) / (arr[high] - arr[low]));

if (arr[mid] < key)
low = mid + 1;
else if (key < arr[mid])
high = mid - 1;
else
return mid;
}

if (key == arr[low])
return low;
else
return -1;
}

template <typename T>
int linear_scan(T array1[], int long_length, T array2[], int short_length, T array3[])
{
int finds2 = 0;
//printf("long_length %d short_length %d\n",long_length,short_length);
for (int s1 = 0, s2 = 0; s1 < long_length && s2 < short_length;)
{
//printf("compare array1[%d]=%d with array2[%d]=%d\n",s1,array1[s1],s2,array2[s2]);
if (array1[s1] == array2[s2])
{
array3[finds2++] = array1[s1];
s1++;
s2++;
}
else if (array1[s1] > array2[s2])
{
s2++;
}
else
{
s1++;
}
}
return finds2;
}

int main(int args, char **argc)
{
clock_t start, end;
double cpu_time_used1, cpu_time_used2;
for (int long_length = LONG_LENGTH/10; long_length < LONG_LENGTH; long_length += LONG_LENGTH / LONG_SPLITS)
{
uint32_t *array1 = GetRandomArray(long_length,long_length);

for (int short_length = 1; short_length < MAX_SHORT; short_length += MAX_SHORT / SPLITS)
{

uint32_t *array2 = GetRandomArray(short_length,long_length);
uint32_t *array3 = (uint32_t *)malloc(short_length * sizeof(uint32_t));

// time an interpolation search
start = clock();

int finds1 = 0;
int offset = 0;
for (int s2 = 0; s2 < short_length; s2++)
{
int found = interpolation_search(&array1[offset], long_length-offset, array2[s2]);
if (found != -1)
{
array3[finds1++] = array2[s2];
offset=found+1;
}
}
end = clock();
cpu_time_used1 = ((double)(end - start)) / CLOCKS_PER_SEC;

// repeat with a normal search
start = clock();
int finds2 = linear_scan(array1, long_length, array2, short_length, array3);

end = clock();
cpu_time_used2 = ((double)(end - start)) / CLOCKS_PER_SEC;
if(finds1 != finds2){
printf("Error %d != %d %d %d\n",finds1,finds2,short_length,long_length);
assert(false);
}

printf("short %d long %d Intersect %d linear time %f secs interpolation time %f secs\n", short_length, long_length, finds2, cpu_time_used2, cpu_time_used1);

free(array2);
free(array3);
}
free(array1);
printf("\n");
}
}

I have moved some code into functions using a C++ template for the array types. 

We look at assorted short and long array lengths and compare how long each takes.

You can import into a spreadsheet and graph the outcomes.

What you can find is that the interpolation search has a time:

X1 * short_length * log2(long_length)

Each doubling of the long_length adds an operation, our X1 will be smaller than for a basic binary swap, but the maths doesn't change. Even the fact that we reduce the array length we are searching by only searching the array from the point of the last find onwards doesn't alter this.

and the scanning method approximates:

X2 * long_length

We have to step through most of the long array regardless and so that is also quite instinctive.

 So sooner or later :

X1 * short_length * log2(long_length) > X2 * long_length

as plainly we deteriorate to:

 X1 * long_length * log2(long_length) > X2 * long_length

which will be true for some value of long_length

When we pitch short_length as 1 or 2 there is of course a massive variance in the time of the scanning search as if we have low random numbers in the short array the scan might termination after a low number of iterations.

Is this the end of the story on intersecting arrays, not by a long shot. So please look for part3.

 



Comments

Popular posts from this blog

A reliable tuya authentication in node-red

node-red a full home automation example

Arduino boards - what I wish I'd known when starting