Let's unroll a loop

Loop unrolling is an optimisation that has had a small place in my heart, ever since I was working at a Dilbertesque company with so many management layers that on one project I was a couple of years into it before hearing that some guy I had never heard of and who had never bothered to meet the troops was actually in charge!

But I digress, it was on that sort of project where they had made promises to a client as to how fast a mathematical model was going to work and obviously it didn't and they thought they we going to need to lose a significant amount of dosh to buy better kit for the client.

I profiled the code, found the hotspot in a loop and having looked at the assembler generated saw that no unrolling was done, duly unrolled the code was at least in the speed regard up to spec.

Some time later a man in a grey suit actually deigned to find me in my cubicle and awkwardly say thanks. This was extremely novel at the time.

So here's an example which isn't to far off what I did.

rolling.cpp

github link

// gcc -O unrolling.cpp -lstdc++
#include <cstdio>
#include <stdint.h>
#include <cstring>
#include <cassert>
#include <cstdlib>
#include <time.h>
#define VERIFY
using namespace std;

#define LONG_LENGTH 100000000

static uint32_t getRandom(int low, int high)
{
    return rand() % (high + 1 - low) + low;
}

// array of random numbers
uint32_t *GetRandomArray(uint32_t length)
{
    uint32_t *array = (uint32_t *)malloc(length * sizeof(uint32_t));

    for (int i = 0; i < length; i++)
    {
        array[i] = getRandom(1, length );
    }

    return array;
}

int main(int args, char **argc)
{
    clock_t start, end;
    double cpu_time_used1, cpu_time_used2;

    uint32_t *array1 = GetRandomArray(LONG_LENGTH);
    uint32_t *array2 = GetRandomArray(LONG_LENGTH);
    uint32_t *array3 = (uint32_t *)malloc(LONG_LENGTH * sizeof(uint32_t));
    uint32_t *array4 = (uint32_t *)malloc(LONG_LENGTH * sizeof(uint32_t));
    start = clock();
    for (int i = 0; i < LONG_LENGTH; i++)
    {
        if ((i % 2) == 0)
        {
            array3[i] = array1[i] * array2[i];
        }
        else
        {
            array3[i] = array1[i] + array2[i];
        }
    }

    end = clock();
    cpu_time_used1 = ((double)(end - start)) / CLOCKS_PER_SEC;

    // repeat unrolled
    start = clock();

    int i;
    for (int i = 0; i < LONG_LENGTH - 1; i += 2)
    {
        array4[i] = array1[i] * array2[i];

        array4[i + 1] = array1[i + 1] + array2[i + 1];
    }
    for (; i < LONG_LENGTH; i++)
    {
        if ((i % 2) == 0)
        {
            array4[i] = array1[i] * array2[i];
        }
        else
        {
            array4[i] = array1[i] + array2[i];
        }
    }

    end = clock();
    cpu_time_used2 = ((double)(end - start)) / CLOCKS_PER_SEC;
  
#ifdef VERIFY
    bool bad = false;
    for (int i = 0; i < LONG_LENGTH; i++)
    {
        if (array3[i] != array4[i])
        {
            bad = true;
            break;
        }
    }
    if (bad)
    {
        printf("Error arrays do not match\n");
        assert(false);
    }
#endif
    printf("loop time %f secs unrolled %f secs %f\n",  cpu_time_used2, cpu_time_used1, cpu_time_used1 / cpu_time_used2);
    free(array1);
    free(array2);
    free(array3);
    free(array4);
}

A typical output is:

loop time 0.280279 secs unrolled 0.194381 secs 0.693527

Showing a healthy 30% speed up by simply taking a conditional expression out of a loop and performing two iterations at once, of course we have to deal messily with the potential tail end of the array.

Of course there is no rule about it being just two iterations.

rolling4.cpp has these lines:

    int i;
    for (int i = 0; i < LONG_LENGTH - 4; i += 4)
    {
        array4[i] = array1[i] * array2[i];
        array4[i + 1] = array1[i + 1] + array2[i + 1];
        array4[i + 2] = array1[i + 2] * array2[i + 2];
        array4[i + 3] = array1[i + 3] + array2[i + 3];
    }
    for (; i < LONG_LENGTH; i++)
    {
        if ((i % 2) == 0)
        {
            array4[i] = array1[i] * array2[i];
        }
        else
        {
            array4[i] = array1[i] + array2[i];
        }
    }

But in this example the benefit of less loop operations is offset by more calculations in the loop and we get a very similar result.

For completeness lets look at:

unrolling8.cpp

    int i;
    for (int i = 0; i < LONG_LENGTH - 8; i += 8)
    {
        array4[i] = array1[i] * array2[i];
        array4[i + 1] = array1[i + 1] + array2[i + 1];
        array4[i + 2] = array1[i + 2] * array2[i + 2];
        array4[i + 3] = array1[i + 3] + array2[i + 3];
        array4[i + 4] = array1[i + 4] * array2[i + 4];
        array4[i + 5] = array1[i + 5] + array2[i + 5];
        array4[i + 6] = array1[i + 6] * array2[i + 6];
        array4[i + 7] = array1[i + 7] + array2[i + 7];
    }
    for (; i < LONG_LENGTH; i++)
    {
        if ((i % 2) == 0)
        {
            array4[i] = array1[i] * array2[i];
        }
        else
        {
            array4[i] = array1[i] + array2[i];
        }
    }

And still pretty much identical. It's one of the reasons I like these exercises. I form an expectation and not infrequently I get a result I really don't expect.

Even rolling16.cpp holds up well. Are there clues in the assembler?

Don't worry if your knowledge of assembler is not up to scratch. It doesn't have to be.

First lets look at the unrolled loop we see by getting gcc to output assembler:

gcc -S unrolling.cpp

This is without any optimisation.

.L13:
    cmpl    $99999998, -72(%rbp)
    jg  .L12
    movl    -72(%rbp), %eax
    cltq
    leaq    0(,%rax,4), %rdx
    movq    -64(%rbp), %rax
    addq    %rdx, %rax
    movl    (%rax), %ecx
    movl    -72(%rbp), %eax
    cltq
    leaq    0(,%rax,4), %rdx
    movq    -56(%rbp), %rax
    addq    %rdx, %rax
    movl    (%rax), %eax
    movl    -72(%rbp), %edx
    movslq  %edx, %rdx
    leaq    0(,%rdx,4), %rsi
    movq    -40(%rbp), %rdx
    addq    %rsi, %rdx
    imull   %ecx, %eax
    movl    %eax, (%rdx)
    movl    -72(%rbp), %eax
    cltq
    addq    $1, %rax
    leaq    0(,%rax,4), %rdx
    movq    -64(%rbp), %rax
    addq    %rdx, %rax
    movl    (%rax), %ecx
    movl    -72(%rbp), %eax
    cltq
    addq    $1, %rax
    leaq    0(,%rax,4), %rdx
    movq    -56(%rbp), %rax
    addq    %rdx, %rax
    movl    (%rax), %edx
    movl    -72(%rbp), %eax
    cltq
    addq    $1, %rax
    leaq    0(,%rax,4), %rsi
    movq    -40(%rbp), %rax
    addq    %rsi, %rax
    addl    %ecx, %edx
    movl    %edx, (%rax)
    addl    $2, -72(%rbp)
    jmp .L13

That looks complicated if you don't do assembler, but it's not hard to pick out, the repeats of "addq $1, %rax" corresponding to [i+1], it's plainly not exactly optimal.

So:

gcc -O -S unrolling.cpp

.L10:
    movl    0(%rbp,%rax), %edx
    imull   (%rbx,%rax), %edx
    movl    %edx, (%r12,%rax)
    movl    4(%rbx,%rax), %edx
    addl    4(%rbp,%rax), %edx
    movl    %edx, 4(%r12,%rax)
    addq    $8, %rax
    cmpq    $400000000, %rax
    jne .L10

Wow! that's efficient. Even our increment "i+=2" has become a +8 corresponding the the fact that we move moving forward through our arrays in 2 4 byte increments. Our "overhead" for each unrolling step is the couple of instructions that add 4 to the index within the loop.

But even a little overhead is still overhead right? Well not for more than two decades and more. Modern processors will do out of order/dynamic execution performing non conflicting instructions at the same time.

It's reasonable to assume given the steady benchmarks as I add more unrolling to the loop, that this is coming into play.

In conclusion optimisation is always a balance between effort, complexity and diminishing returns. The initial 30% gain, just getting the conditional out of the loop would probably have been a good stopping point in the real world.

But in the context of educational exercises, there is often something to be learned from a deeper dive.

Search This Blog

Martin Walker

Let's unroll a loop

Comments

Post a Comment

Popular posts from this blog

battery claims

A reliable tuya authentication in node-red

General relationship manager