Let's unroll a loop

 Loop unrolling is an optimisation that has had a small place in my heart, ever since I was working at a Dilbertesque company with so many management layers that on one project I was a couple of years into it before hearing that some guy I had never heard of and who had never bothered to meet the troops was actually in charge!

But I digress, it was on that sort of project where they had made promises to a client as to how fast a mathematical model was going to work and obviously it didn't and they thought they we going to need to lose a significant amount of dosh to buy better kit for the client.

I profiled the code, found the hotspot in a loop and having looked at the assembler generated saw that  no unrolling was done, duly unrolled the code was at least in the speed regard up to spec.

Some time later a man in a grey suit actually deigned to find me in my cubicle and awkwardly say thanks.  This was extremely novel at the time.

So here's an example which isn't to far off what I did.

rolling.cpp

github link

// gcc -O unrolling.cpp -lstdc++
#include <cstdio>
#include <stdint.h>
#include <cstring>
#include <cassert>
#include <cstdlib>
#include <time.h>
#define VERIFY
using namespace std;

#define LONG_LENGTH 100000000

static uint32_t getRandom(int low, int high)
{
return rand() % (high + 1 - low) + low;
}

// array of random numbers
uint32_t *GetRandomArray(uint32_t length)
{
uint32_t *array = (uint32_t *)malloc(length * sizeof(uint32_t));

for (int i = 0; i < length; i++)
{
array[i] = getRandom(1, length );
}

return array;
}

int main(int args, char **argc)
{
clock_t start, end;
double cpu_time_used1, cpu_time_used2;

uint32_t *array1 = GetRandomArray(LONG_LENGTH);
uint32_t *array2 = GetRandomArray(LONG_LENGTH);
uint32_t *array3 = (uint32_t *)malloc(LONG_LENGTH * sizeof(uint32_t));
uint32_t *array4 = (uint32_t *)malloc(LONG_LENGTH * sizeof(uint32_t));
start = clock();
for (int i = 0; i < LONG_LENGTH; i++)
{
if ((i % 2) == 0)
{
array3[i] = array1[i] * array2[i];
}
else
{
array3[i] = array1[i] + array2[i];
}
}

end = clock();
cpu_time_used1 = ((double)(end - start)) / CLOCKS_PER_SEC;

// repeat unrolled
start = clock();

int i;
for (int i = 0; i < LONG_LENGTH - 1; i += 2)
{
array4[i] = array1[i] * array2[i];

array4[i + 1] = array1[i + 1] + array2[i + 1];
}
for (; i < LONG_LENGTH; i++)
{
if ((i % 2) == 0)
{
array4[i] = array1[i] * array2[i];
}
else
{
array4[i] = array1[i] + array2[i];
}
}

end = clock();
cpu_time_used2 = ((double)(end - start)) / CLOCKS_PER_SEC;
#ifdef VERIFY
bool bad = false;
for (int i = 0; i < LONG_LENGTH; i++)
{
if (array3[i] != array4[i])
{
bad = true;
break;
}
}
if (bad)
{
printf("Error arrays do not match\n");
assert(false);
}
#endif
printf("loop time %f secs unrolled %f secs %f\n", cpu_time_used2, cpu_time_used1, cpu_time_used1 / cpu_time_used2);
free(array1);
free(array2);
free(array3);
free(array4);
}

A typical output is:

loop time 0.280279 secs unrolled 0.194381 secs 0.693527

Showing a healthy 30% speed up by simply taking a conditional expression out of a loop and performing two iterations at once, of course we have to deal messily with the potential tail end of the array.

Of course there is no rule about it being just two iterations.

rolling4.cpp has these lines:

int i;
for (int i = 0; i < LONG_LENGTH - 4; i += 4)
{
array4[i] = array1[i] * array2[i];
array4[i + 1] = array1[i + 1] + array2[i + 1];
array4[i + 2] = array1[i + 2] * array2[i + 2];
array4[i + 3] = array1[i + 3] + array2[i + 3];
}
for (; i < LONG_LENGTH; i++)
{
if ((i % 2) == 0)
{
array4[i] = array1[i] * array2[i];
}
else
{
array4[i] = array1[i] + array2[i];
}
}

But in this example the benefit of less loop operations is offset by more calculations in the loop and we get a very similar result.

For completeness lets look at:

unrolling8.cpp

int i;
for (int i = 0; i < LONG_LENGTH - 8; i += 8)
{
array4[i] = array1[i] * array2[i];
array4[i + 1] = array1[i + 1] + array2[i + 1];
array4[i + 2] = array1[i + 2] * array2[i + 2];
array4[i + 3] = array1[i + 3] + array2[i + 3];
array4[i + 4] = array1[i + 4] * array2[i + 4];
array4[i + 5] = array1[i + 5] + array2[i + 5];
array4[i + 6] = array1[i + 6] * array2[i + 6];
array4[i + 7] = array1[i + 7] + array2[i + 7];
}
for (; i < LONG_LENGTH; i++)
{
if ((i % 2) == 0)
{
array4[i] = array1[i] * array2[i];
}
else
{
array4[i] = array1[i] + array2[i];
}
}

And still pretty much identical. It's one of the reasons I like these exercises. I form an expectation and not infrequently I get a result I really don't expect.

Even rolling16.cpp holds up well. Are there clues in the assembler?

Don't worry if your knowledge of assembler is not up to scratch. It doesn't have to be.

First lets look at the unrolled loop we see by getting gcc to output assembler:

gcc -S unrolling.cpp

This is without any optimisation.

.L13:
cmpl $99999998, -72(%rbp)
jg .L12
movl -72(%rbp), %eax
cltq
leaq 0(,%rax,4), %rdx
movq -64(%rbp), %rax
addq %rdx, %rax
movl (%rax), %ecx
movl -72(%rbp), %eax
cltq
leaq 0(,%rax,4), %rdx
movq -56(%rbp), %rax
addq %rdx, %rax
movl (%rax), %eax
movl -72(%rbp), %edx
movslq %edx, %rdx
leaq 0(,%rdx,4), %rsi
movq -40(%rbp), %rdx
addq %rsi, %rdx
imull %ecx, %eax
movl %eax, (%rdx)
movl -72(%rbp), %eax
cltq
addq $1, %rax
leaq 0(,%rax,4), %rdx
movq -64(%rbp), %rax
addq %rdx, %rax
movl (%rax), %ecx
movl -72(%rbp), %eax
cltq
addq $1, %rax
leaq 0(,%rax,4), %rdx
movq -56(%rbp), %rax
addq %rdx, %rax
movl (%rax), %edx
movl -72(%rbp), %eax
cltq
addq $1, %rax
leaq 0(,%rax,4), %rsi
movq -40(%rbp), %rax
addq %rsi, %rax
addl %ecx, %edx
movl %edx, (%rax)
addl $2, -72(%rbp)
jmp .L13

That looks complicated if you don't do assembler, but it's not hard to pick out, the repeats of "addq $1, %rax" corresponding to [i+1], it's plainly not exactly optimal.

So:

gcc -O -S unrolling.cpp

.L10:
movl 0(%rbp,%rax), %edx
imull (%rbx,%rax), %edx
movl %edx, (%r12,%rax)
movl 4(%rbx,%rax), %edx
addl 4(%rbp,%rax), %edx
movl %edx, 4(%r12,%rax)
addq $8, %rax
cmpq $400000000, %rax
jne .L10

 

Wow! that's efficient. Even our increment "i+=2" has become a +8 corresponding the the fact that we move moving forward through our arrays in 2 4 byte increments. Our "overhead" for each unrolling step is the couple of instructions that add 4 to the index within the loop.

But even a little overhead is still overhead right? Well not for more than two decades and more. Modern processors will do out of order/dynamic execution performing non conflicting instructions at the same time.

It's reasonable to assume given the steady benchmarks as I add more unrolling to the loop, that this is coming into play.

 In conclusion optimisation is always a balance between effort, complexity and diminishing returns. The initial 30% gain, just getting the conditional out of the loop would probably have been a good stopping point in the real world.

But in the context of educational exercises, there is often something to be learned from a deeper dive.


 




Comments

Popular posts from this blog

A reliable tuya authentication in node-red

node-red a full home automation example

Arduino boards - what I wish I'd known when starting