Array Intersections - part 4

 In our last visit to array intersections we looked at using "simd" instructions to speed up the process of array intersections.

It's a quite well documented technique and one I have used successfully myself.

Things didn't go quite to plan. The code worked but was slower than the basic technique of scanning two arrays.

Now admittedly I had simplified ever so slightly I used:

if ((mask & 1) > 0)
{
array3[count++] = y[posy];
}
if ((mask & 2) > 0)
{
array3[count++] = y[posy + 1];
}
if ((mask & 4) > 0)
{
array3[count++] = y[posy + 2];
}
if ((mask & 8) > 0)
{
array3[count++] = y[posy + 3];
}

to copy the elements into the output array when they match according to the mask we found using the simd instructions.

The real world implementations I have seen use more simd instructions for this task.

This is usually done, not with a case statement but an array of masks, but curiously my gcc compiler is determined that _mm_shuffle_epi32(value, mask) sees mask as a constant discernable at compile time. 

So: simd_intersect_mk2.cpp

#define COPYOUT(a,f,m) _mm_storeu_si128((__m128i *)a, _mm_shuffle_epi32(f,m))
switch (mask)
{
case 1:
COPYOUT(&array3[count],a2, _MM_SHUFFLE(1, 2, 3, 0));
break;
case 2:
COPYOUT(&array3[count],a2, _MM_SHUFFLE(1, 2, 3, 1));
break;
case 3:
COPYOUT(&array3[count],a2, _MM_SHUFFLE(2, 3, 1, 0));
break;
case 4:
COPYOUT(&array3[count],a2, _MM_SHUFFLE(1, 2, 3, 2));
break;
case 5:
COPYOUT(&array3[count],a2, _MM_SHUFFLE(1, 3, 2, 0));
break;
case 6:
COPYOUT(&array3[count],a2, _MM_SHUFFLE(3, 0, 2, 1));
break;
case 7:
COPYOUT(&array3[count], a2, _MM_SHUFFLE(3, 2, 1, 0));
break;
case 8:
COPYOUT(&array3[count],a2, _MM_SHUFFLE(0, 1, 2, 3));
break;
case 9:
COPYOUT(&array3[count], a2, _MM_SHUFFLE(1, 2, 3, 0));
break;
case 10:
COPYOUT(&array3[count], a2, _MM_SHUFFLE(0, 2, 3, 1));
break;
case 11:
COPYOUT(&array3[count], a2, _MM_SHUFFLE(2, 3, 1, 0));
break;
case 12:
COPYOUT(&array3[count], a2, _MM_SHUFFLE(1, 2, 3, 2));
break;
case 13:
COPYOUT(&array3[count], a2, _MM_SHUFFLE(1, 3, 2, 0));
break;
case 14:
COPYOUT(&array3[count],a2, _MM_SHUFFLE(0, 3, 2, 1));
break;
case 15:
COPYOUT(&array3[count],a2, _MM_SHUFFLE(3, 2, 1, 0));
break;
}

count += mask_count[mask]; // a number of elements is a weight of the mask



This mouthful shuffles the integers to be output to the lowest element of the array.

The numbers start to be understandable by looking at 15 which is to add all the values without changing order and then at 10 where we want positions 1 and 3 written out.

Running this makes for a very small improvement, but we might as well go a bit further, as we have a case statement we are being surely a bit silly with the cases where we just want to write out elements that are in contiguous order. 

So: simd_intersect_mk3.cpp

#define COPYOUT(a, f, m) _mm_storeu_si128((__m128i *)a, _mm_shuffle_epi32(f, m))
switch (mask)
{
case 1:
array3[count] = y[posy];
break;
case 2:
array3[count] = y[posy + 1];
break;
case 3:
memcpy((void *)&array3[count], (void *)&y[posy], 8);
break;
case 4:
array3[count] = y[posy + 2];
break;
case 5:
COPYOUT(&array3[count], a2, _MM_SHUFFLE(1, 3, 2, 0));
break;
case 6:
memcpy((void *)&array3[count], (void *)&y[posy + 1], 8);
break;
case 7:
memcpy((void *)&array3[count], (void *)&y[posy], 12);
break;
case 8:
array3[count] = y[posy + 3];
break;
case 9:
COPYOUT(&array3[count], a2, _MM_SHUFFLE(1, 2, 3, 0));
break;
case 10:
COPYOUT(&array3[count], a2, _MM_SHUFFLE(0, 2, 3, 1));
break;
case 11:
COPYOUT(&array3[count], a2, _MM_SHUFFLE(2, 3, 1, 0));
break;
case 12:
memcpy((void *)&array3[count], (void *)&y[posy + 2], 8);
break;
case 13:
COPYOUT(&array3[count], a2, _MM_SHUFFLE(1, 3, 2, 0));
break;
case 14:
memcpy((void *)&array3[count], (void *)&y[posy + 1], 12);
break;
case 15:
memcpy((void *)&array3[count], (void *)&y[posy], 16);
break;
}

count += mask_count[mask]; // a number of elements is a weight of the mask


 Might seem to be cleverer, but is in fact slower. Shows something about the speed of the simd instructions even if they are not playing well for us.

What might this mystery be? The truth is in the horrors of memory alignment, we are dealing in 32bit values on 64bits systems.

In the code we have:

// the array with the lowest end element can move forward 4
if (y[posy + 3] > x[posx + 3])
{
posx += 4;
// the y array elements were not shuffled
// so we know the position of the last match
posy += move_forward[mask];
}
else
{
posy += 4;
// we do notknow the position of the last matched element in the
// x array but it we had a certain number of matches
// me must be able to move forward by at least the count of
// the bits in the mask
posx += mask_count[mask];
}


Which moves us forward through both arrays, one moves 128bytes forward the other will move a multiple of 32bytes forward breaking 64bit alignment. 

If we take the hit of not moving forward on the other array.

simd_interesect_mk4.cpp:
 

// the array with the lowest end element can move forward 4
if (y[posy + 3] > x[posx + 3])
{
posx += 4;
}
else
{
posy += 4;
}


We finally see an output:

Total linear time 0.832165 seconds, total simd time 0.390565 ratio 0.469336

Which makes the simd time significantly faster.

Can we improve further?

I made a series of attempts.

mk1 Total linear time 0.580798 seconds, total simd time 0.987996 ratio 1.701101

The original unaligned first try.

mk2 Total linear time 0.554349 seconds, total simd time 0.907844 ratio 1.637676

Use simd instructions to copy the intersections to the output array. A noticable improvement.


mk3 Total linear time 0.609809 seconds, total simd time 1.032354 ratio 1.692914

Try and pick and choose when to use simd for the intersection copy. Well that did not work.

mk4 Total linear time 0.832165 seconds, total simd time 0.390565 ratio 0.469336

Simply solve the alignment issue. We have lift off.


mk5 Total linear time 0.763136 seconds, total simd time 0.678215 ratio 0.888721

Set up the first shuffle values outside the loop, and only redo the needed values when we shift to the next element. A surprising failure.

mk6 Total linear time 0.753321 seconds, total simd time 0.684975 ratio 0.909274

Try and push both arrays forward but respecting 64 bit alignment.


mk7 Total linear time 0.670275 seconds, total simd time 0.534649 ratio 0.797656

revisit simple output of values.


mk8 Total linear time 0.558853 seconds, total simd time 0.426161 ratio 0.762564

try and cherry pick output using simd for anything more than one value.









Comments

Popular posts from this blog

A reliable tuya authentication in node-red

Arduino boards - what I wish I'd known when starting

node-red a full home automation example