Getting Started with SHARC+ Assembly Programming, part 2: Branch and Loop

2021-07-12

/uploads/blog/2021/1626292750849-DSC_0084.jpeg

Conditional Execution

This time we are going to take a look at how to implement the loop in SHARC+ assembly. We will be using the following simple code as an example:

// Calculate the Fibonacci number Fn, n need to be at least 2
int fibonacci(int n) {
    int a = 0, b = 1, f;
    for (int i = 1; i < n; i++) {
        f = a + b;
        a = b;
        b = f;
    }
    return f;
}

Normally, to implement it in assembly you would have code like this (RISC-V assembly for example):

    li t0, 0 # a
    li t1, 1 # b
    li t3, 1 # loop counter
loop_start:
    add t2, t0, t1
    mv t0, t1
    mv t1, t2
    addi t3, t3, 1
    blt t3, a0, loop_start

    mv a0, t2

A similar thing could be implemented in SHARC+ as well. Let's take a look at the code first (note the code is fairly inefficient, and we will improve that gradually):

    r2 = 1; // loop counter
    r8 = 0;
    r12 = 1;
loop_start:
    r0 = r8 + r12;
    r8 = r12;
    r12 = r0;
    r2 = r2 + 1;
    comp(r2, r4);
    if lt jump loop_start;

Aside from the syntax difference, one major difference is that SHARC+ uses flags to do the conditional jump, so a comparison instruction might be necessary before the condition. In this example, the loop counter is compared against the end condition. If the loop counter is less than (lt) the end condition, it jumps back to the beginning of the loop. If it is counting down, then we could use flags that directly compare the last result with zero. There are other conditional codes (and there complementary) available:

And there are few more special condition codes for loop, that would be covered later.

Running the example

Now we could try to run these codes on the board by reusing the framework we had in the part I.

Create a new file called fibonacci_test.c, use the provided C version as the reference code. Similarly we generate 256 random numbers, and calculate the fibonacci sequence of random number. Finally, it compares the result between reference code and assembly code.

void fibonacci_test() {
    int *src;
    int *dst_ref;
    int *dst;

    src = malloc(sizeof(int) * BLOCKSIZE);
    assert(src);
    dst_ref = malloc(sizeof(int) * BLOCKSIZE);
    assert(dst_ref);
    dst = malloc(sizeof(int) * BLOCKSIZE);
    assert(dst);

    random_int(src, 2, 100, BLOCKSIZE);

    clock_t cycles_ref = clock();
    for (int i = 0; i < BLOCKSIZE; i++) {
        dst_ref[i] = fibonacci_ref(src[i]);
    }
    cycles_ref = clock() - cycles_ref;

    clock_t cycles = clock();
    for (int i = 0; i < BLOCKSIZE; i++) {
        dst[i] = fibonacci(src[i]);
    }
    cycles = clock() - cycles;

    printf("Finished in %d cycles (Ref %d cycles, %.2fX)\n", cycles, cycles_ref,
            (float)cycles_ref/(float)cycles);
    for (int i = 0; i < BLOCKSIZE; i++) {
        if (dst_ref[i] != dst[i]) {
            printf("Test failed! First 4 numbers in the result:\n");
            for (int i = 0; i < 4; i++) {
                printf("Result: %d, Ref: %d\n", dst[i], dst_ref[i]);
            }
            break;
        }
    }

    free(src);
    free(dst_ref);
    free(dst);
}

Create another new file called fibonacci.asm, containing the assembly version provided above. Remember to include the function definition, segment definition, etc. similar from the last time.

Run the code, the output should be something similar to this

Finished in 115343 cycles (Ref 32515 cycles, 0.28X)

Note the code is much slower than the reference code. This is expected, and we are going to improve that.

Delay Slot

Remember delay slots from the MIPS? They are also present in the SHARC. It is optional, by using the “db” (delayed branch) flag inside the parentheses after the label. For example:

if cond jump label (db); instruction 1; Instruction 2;

In this example, two instructions after the jump are in the delay slot, and they will be executed before the jump. Note this would happen regardless of the condition: even if the jump doesn’t happen, these two instructions are still executed.

What happens if the programmer decides not to use the (db)? In that case, the pipeline just stalls for 2 cycles, as if there are 2 nops in the delay slot. This could make sense if the code is not in the critical region and speed is not important.

By using the delay slot, the previous code example could be optimized as follows:

    r2 = 1; // loop counter
    r8 = 0;
    r12 = 1;
loop_start:
    r0 = r8 + r12;
    r2 = r2 + 1;
    comp(r2, r4);
    if lt jump loop_start (db);
        r8 = r12;
        r12 = r0;

The previous code takes 8 cycles per iteration (6 instructions + 2 stalls), the new code only takes 6 cycles per iteration.

We could also try running the code in the example, the result should be something like this:

Finished in 91110 cycles (Ref 32515 cycles, 0.36X)

Loop

The previous code does the loop just fine. But in tight loops like FIR filtering, or the example given, the overhead of loop counting and branching can get significant. Like in the FIR filtering, the actual calculation could be done in 1 cycle per iteration, but the loop counting and branching take at least 2 cycles. The processor ended up using only 33% of its time doing real calculations. One way to reduce the overhead is to aggressively unrolling the loop. Another way is to introduce zero-overhead loop support at ISA level. This means the loop counting and branching are done with special hardware units. This is quite common among DSP architectures, including SHARC. The syntax is as follows:

lcntr=r8, do label until lce

The "lcntr" refers to the hardware loop counter, "label" is the label to the last instruction in the loop, and "lce" (loop counter expired) is one of the special conditional code for loop:

Generally, loop instruction is only used to implement a counter-based loop, such as a typical for-loop. But non-counter-based loops are supported as well, by using condition code other than "lce" after "until", for example:

do label until eq

By implementing the previous loop using this loop instruction, the loop counting and jumping code could be removed. The resulting code is shorter:

    r8 = 0;
    r12 = 1;
    r4 = r4 - 1;
lcntr = r4, do loop_end until lce;
    r0 = r8 + r12;
    r8 = r12;
loop_end:
    r12 = r0;

Now it only takes 3 cycles per iteration compared to 6 before. Now the performance is around 71% of the compiled code:

Finished in 45743 cycles (Ref 32515 cycles, 0.71X)

The code could be further optimized by leveraging the VLIW multi-issue capability:

    r8 = 0;
    r0 = 1;
    r4 = r4 - 1;
lcntr = r4, do loop_end until lce;
loop_end:
    r0 = r8 + r0, r8 = r0;

The addition and copy are done in the same cycle, so there is no need for an additional register to hold the old value. The whole loop is now only 1 cycle per iteration. I will talk more about the multi-issue in the future. In this case, the code is even 72% faster than the compiled code.

Finished in 18903 cycles (Ref 32515 cycles, 1.72X)

Tip: Ever wondering what compiler is doing? To view the code generated by the compiler, go to the project properties, in CrossCore SHARC C/C++ Compiler -> General, check Save temporary files. Rebuild the project, the compiler generated code is located at Debug/src/fibonacci_test.s. We could take a look here:

    lcntr=r0, do (pc,.P33L2_end) until lce;
    r0=r2+r1, r2=r1;
.P33L2_end:
    r1=pass r0

The compiler realized the addtion could be issued together with the register copy, but didn't realize if the copy could be done in one cycle, there is no need for additional register.

Back to the discussion about loops. Loop instructions could be nested as well:

lcntr = r, do outer_loop_end until lce;
    lcntr = r, do inner_loop_end until lce;
    inner_loop_end:
        instr;
outer_loop_end:
    instr;

Branches (and function calls) are allowed in the loop as well:

lcntr = r, do loop_end until lce;
    if cond jump skip;
    call func;
skip:
    instr;
loop_end:
    instr;

Aborting the loop by using jump ("break" in C) is also allowed. Just use the "la" (loop abort) flag in the jump:

jump out_of_loop (la);

The "la" and "db" could be used together as well.

Note that generally if you have codes that jump out of the loop, the compiler tends to avoid using loop instructions at all.

Conclusion

This time we talked about branches and loops. We also gradually optimized one function by using the capability provided by the processor, and finally beat the compiler. Next time we will talk about memory access. Until then, thanks for reading.