Getting Started with SHARC+ Assembly Programming, part 3: Memory Access

2021-08-12

This time I am going to cover the memory access. Because a lot of DSP algorithms are quite heavy on memory access, so the handling of the memory access has a big impact on the performance. As a result, the memory access model on SHARC is not as simple as usually found on RISC processors. Note: modern RISC processors have much more sophisticated memory systems than SHARC+, but they are usually hidden away from the programmer.

Data Access

Basic Address Calculation

In SHARC, memory addresses are calculated by the DAG (Data Address Generator). When accessing memory, DAG uses 2 registers to calculate the address, one index (I) register, and one modify (M) register. The address is always index + modify. For example, if I want to access memory address 0xc0000000, I could load 0xc0000000 into the index register, then use constant 0 for the modify. The code would be as follows:

i4 = 0xc0000000;
m4 = 0;
r4 = dm(m4, i4);

DM refers to data memory, SHARC uses the syntax dm(modify reg, index reg) to access memory. To store into that address, just put dm on the LHS:

dm(m4, i4) = r4;

In SHARC, some modify registers always hold certain constant values: m5 is 0, m6 is 1, and m7 is -1. So the code could also be:

dm(m5, i4) = r4;

Immediate value as modify is also allowed. The width is limited to the different instruction encoding, which could be 6bit, 24bit, or 32bit. The index must come from a register.

dm(0, i4) = r4;

Say if I have two consecutive 32-bit values in the memory (for example, a complex number, or a 2D coordination) in the memory, pointed by i4, and now I want to load the first value (real part) into r4 and the second value (complex part) into r8, one way to achieve this is:

r4 = dm(m5, i4);
r8 = dm(m6, i4);

The first is modified by 0, and the second one is modified by 1. The modify value is counted in words, so 4 bytes by default.

I said one way, as there is another way. In the previous example, the modify is applied to the index then used for accessing the memory. This is called pre-index mode. There is also a post-index mode, where the index value is used for accessing the memory, then the modify is added to the index, and write back to the index register. In the pre-index mode, the indexing register is not updated, but in post-index mode, the indexing register is updated with the new address. So the previous code could be written as follows as well:

r4 = dm(i4, m6);
r8 = dm(i4, m6);

The pre-indexing mode is usually useful in accessing fields in a struct, while the post indexing mode is usually useful in accessing elements in an array within a loop.

Access Length

In the previous examples, all accesses are 32-bit. Additionally, SHARC supports 8-bit (byte word, "bw"), 16-bit (short word, "sw"), and 64-bit (long word, "lw") access. The length could be denoted by using parentheses after the instruction:

r4 = dm(i4, m5) (bw);
r4 = dm(i4, m5) (lw);

Keep in mind that the meaning of modifier also changes. For example, in 32bit access, modify of 1 increment the address by 4. In 8-bit access, it would only increment by 1. The address needs to be aligned to the word size, for example, 64bit access expects 64bit alignment.

Additionally, there is a modify instruction, that only does the address calculation, but not accessing the memory. This could be useful when accessing a 2D or 3D array in a nested loop:

Note: on X86 there is a similar instruction called lea (load effective address), which is now commonly used to do arithmetic operations as well. On SHARC modify is usually only used for address calculations due to register limitations.

Circular Buffer

The SHARC+ address generator also supports a circular buffer mode. In this mode, the address range is limited to a certain range. Trying to increment or decrement the address out of bound would be automatically wrapped around.

This is done by using 2 additional registers: base (B) register and length (L) register. Combined they specify the size of the buffer.

Dual bus access

As we know, on Harvard architecture machines, there are 2 buses: one for instruction, another for data. This is true for SHARC as well, the instruction bus is called PM (program memory) bus, and the data bus is called DM (data memory) bus. In DSP applications, it is quite often the core needs to access large constant tables during calculation. It would be great if the processor could access 2 different addresses at the same time (in one cycle). SHARC allows this by allowing the program to access data through the instruction/ PM bus. So the programmer could put the constant table in the code memory, then access the table using PM bus. Instructions are supplied by a small cache (called conflict cache) in the core.

The core has a set of indexing and modifying registers for PM bus data access, called i8-i15 and m8-m15. To use PM for data access, the programmer needs to use these registers for indexing. For example,

r4 = pm(i12, m12);

Of course, PM access and DM access could be overlapped in one cycle, this is the reason why it exists:

r4 = dm(i4, m4), r8 = pm(i12, m12);

We will talk more a2bout multi-issue capabilities in the future. All SRAM blocks in SHARC are single-port RAM, so only one address could be accessed in one cycle. So to finish this instruction in one cycle, these two addresses need to be backed by different physical RAM blocks. Otherwise, the core will be stalled to fetch the data, providing no performance improvement.

Memory hierarchy

One major reason we are writing assembly code by hand is that hoping they will be faster than C compiled code. Understanding the memory hierarchy and caches in the SHARC+ is quite important for optimal performance.

For memory, there are 3 layers of memory: On-chip fast RAM (L1 SRAM), on-chip slow RAM (L2 SRAM), off-chip RAM (L3 RAM). This is similar to the TCM/OCRAM/SDRAM hierarchy commonly found on modern ARM and RISC-V microcontrollers.

L1 memory is the fastest, enabling single-cycle access. On SHARC+ devices, L1 SRAM is divided into 4 banks, behind a crossbar to the core PM bus and DM bus. Usually, 2 banks are for DM data, one bank for PM data, and one bank for PM instruction. Because it’s behind a crossbar, nothing is stopping you from using the DM bus to access everything. It is for the reason previously stated, one would want to let different buses access different memory banks for optimal performance.

L2 memory is slower, takes ~10 cycles to access. It is on the system crossbar so it’s shared across all cores. Usually, it is used for inter-core communication, but in case that the data doesn’t fit in the L1, it is also commonly used for data.

L3 memory is the slowest. Many applications don’t use it at all or throw code/ data that doesn’t care about speed there. In some rare cases that even L2 is not enough, data are put in L3.

Caches

Because things don’t always fit in L1, caches are critical for performance. On SHARC+, one interesting thing about cache is that there is only a cache controller, but no dedicated cache memory. To use the cache, L1 RAM space is then allocated to be used as cache memory. This gives the user flexibility to decide how they want to use the on-chip memory. The size is configurable as well, but due to the memory speed, it’s limited to be directly mapped.

There are 3 caches: PM code cache, PM data cache, and DM data cache. They are allocated in different banks in L1.

There is also a small (32 instruction) conflict cache in the core, that provides instruction to the core only if dual bus memory access is used.

Now let’s talk about the performance implications of using caches. Most DSP algorithms are quite cache-friendly, using L2 + cache is not much slower than running on L1. But there are exceptions. Note if you couldn't understand this part now, it is fine. Come back later when you you have more experience writing assembly code.

Due to the nature of the cache being a direct-mapped cache, if several buffers used in one function are exactly gapped by the cache size, accessing buffers within the loop would keep thrashing the cache, leading to extremely inefficient code.

It also gets interesting when you are using dual bus access and cache together.

Generally, if you place your data on different SRAMs, running off from L1 gives the best performance. But what if you don’t? Say if the PM data and DM data are in the same block. Similarly, there will be performance issues if data and code are put in the same block. Well then putting things in L2 could be better: things will be fetched into the correct cache automatically, so dual bus access doesn’t cause extra stall when running off from cache. But again generally the data placement should be correct in the beginning.

But putting things in L2 could also completely ruins the performance: say you have both coefficient table and data buffer in L2. The coefficient is calculated on the fly instead of a fixed table at compile time. Because the calculation only runs once so the performance isn’t that important and the programmer uses DM bus to write to the table. Then during execution, the code uses PM bus to read the table. Now, because the table is filled using the DM bus at first, at least some part of the table is loaded in the DM data cache. Later in the actual calculation, the code uses the PM bus to access the data. The core would try to query the PM data cache, which would result in a cache miss. To keep cache coherency, the core would then access the DM data cache to check if it was there, this is called cross-check in SHARC. The cross-check could be a hit, in such case, data remains in the DM cache. In the worst case, data is always in DM cache and never brought to PM cache, every access is a cross-check hit, which has 2 cycle delay. Usually, in C the compiler ensures the correct use of buffers, but in assembly, the programmer needs to be careful about these issues.

Conclusion

In this part, I talked a lot about memory access, dual bus architecture, and the performance implications of the caches. Now you have got everything needed to implement most of the DSP algorithms. Next time I will talk about the SIMD (array) and VLIW (vector) processing capabilities to enable faster code.