As I've mentioned a number of times already during this series, eliminating the central-memory bottleneck is one of the performance benefits of building custom hardware vs a CPU approach. That does not mean that FPGA designs never include memory; they are often used with together with an external memory. However, the topic of this installment is a another kind of memory that is key to performance: embedded memory. Let's start with some example code and then talk about more about embedded memory.
If you are new to this series, you may want to go back to Hardware-as-Code Part I.
Software updateOne last thing before we get into it; you will need to update the Upduino HLS toolchain to the latest version before building the examples below. To do the update, simply enter the following pio
command (if you’re using Visual Studio Code, open a command prompt by clicking the PlatformIO icon (ant) on the left and then selecting Platform Core CLI under Miscellaneous):
> pio platform update upduino_hls
This should install the latest release (0.2.1 at the time of writing).
The simple perceptronLast time, I said that the example in part III involves the same type of computation as neural networks. The following example implements a single layer neural network, called the perceptron, and as you see the code is very similar.
As always, this code is also available from the git repository (https://github.com/sathibault/hac-examples).
This example performs classification just like the poly-classify
example of part III. However, this time we are classifying flowers from the popular iris dataset. There are four data values associated with each flower: sepal length, sepal width, petal length and petal width. In this case, we are classifying the flower type (setosa or not).
When you run the example, you should get the following output:
Ex 0: -822
Ex 1: -837
Ex 2: 94
Ex 3: 777
This shows the output of the perceptron for four example flowers. Output values less than 0 are predicted to be type setosa, and over 0 not setosa. The actual flower types of the four examples are 1) setosa, 2) setosa, 3) versicolour, and 4) virginica. So, this lines up exactly with perceptron predictions above.
Arrays and embedded memoryThe main difference between the example from last time and the dotproduct
function here is the use of arrays. The Upduino HLS tool supports arrays like we've used here, but you are probably wondering about the in_array<int16_t,4>
type on line 5 and the as_array
function on line 27. Sometimes, perfectly normal C/C++ written for CPUs simply does not provide enough information to generate equivalent hardware. That is the case for arrays. Arrays are often passed around C/C++ programs as a simple pointer to some space in central memory where the array is located.
In the case of hardware synthesis, arrays actually get mapped to individual, dedicated memories embedded in the hardware. Yes, you read that correctly! The FPGA contains many small memory blocks called embedded memory or block RAM. Each array variable will have dedicated memory blocks that hold that variable's data.
In order to generate a memory for an array variable, we need to know what size it is, and that's point of the in_array<int16_t,4>
type. This type indicates that the parameter is an input array, the element type is int16_t
and the maximum amount of data in the array is 4. That enables the hardware generator to allocate the correct number of memory blocks to receive the data for the features
parameter.
On line 27, the as_array
function is used to construct the correct type to match the in_array
parameter, but also to indicate how much data is actually in the array (as opposed to the maximum which we know from the parameter type). Since this call executes on the CPU and the function executes on the FPGA, the data must be sent from the CPU to the FPGA. Consequently, we need to know exactly how much data is to be sent.
This is one of the examples where we have to follow certain design patterns in order to target hardware generation. Unfortunately, the exact mechanism may vary from one toolchain to the next. The good news is that there is not a lot of cases like this, and the principal is always the same, even if the syntax varies. For arrays, the principle is we need to provide some size information about arrays, and we do that using certain types or functions provided by the framework we are using.
A multi-class perceptronThe iris dataset actually contains flowers of three different types: setosa, versicolour, and virginica. The percetron model can be extended to multiple classes by simply using one perceptron per class. The one with the highest output is the predicted class. The following is an implementation of a multi-class perceptron to predict the iris flower type.
The following example implements the multi-class perceptron for identifying the three different iris flower types.
The output should look like:
>.\program
Ex 0 = 0 (3694, -3018, -5298)
Ex 1 = 0 (2474, -1038, -6052)
Ex 2 = 1 (-944, 1292, 204)
Ex 3 = 2 (-2392, 138, 5364)
As you can see, the maximum value of each row matches the expected class for that example.
This example introduces another array parameter type out_array
(line 5) and a corresponding resize
method (line 12). This type serves the same purpose as in_array
, but represents an output array parameter. So again, it indicates to the compiler how much space to reserve for this array, i.e. the maximum number of elements it will have.
In addition to the maximum array size, since the out
data needs to be returned to the CPU from the FPGA, the system must know how much data is actually in the array. The resize
method is used to indicate how many elements are actually used and need to be sent back to the caller. Neither of these examples use an input and output array, but that is also available using the inout_array
template type in the same manner.
Another thing you might be wondering is why isn't coef
a two-dimensional array? That's only because the Upduino HLS toolchain only supports one-dimensional arrays. Most HLS toolchains do support multi-dimensional arrays, but I don't find this limitation particularly annoying. The index calculation is easy; it's just row-index * row-length + column-index. Having to write it out helps you be aware of the arithmetic resources required, and most of the time it can be done more efficiently if you think about it (hint: using an index variable and just 2 additions, no multiplications).
There are two main points I want to make about these examples. The first is how embedded memory breaks the memory bottleneck. Since each array has it's own memory, they are completely independent and can be accessed in parallel simultaneously! Furthermore, the embedded memory of many FPGAs have two ports which means you can actually access two values per cycle per array. That's a lot of parallelism. So what's the catch? Well again, we are limited by space.
There are only a fixed number of memory blocks per chip. Memory blocks are described by the number of bits they can store. The FPGA on the UPduino board has 30 block RAMs each with 4096 bits of storage. To calculate the number of blocks an array will need simply multiply the number of elements by the number of bits per element and round up to the nearest multiple of the block RAM size.
On the UPduino, the total number of bits is 30 x 4096 = 120Kb = 15KB. That's really limited, but this FPGA is one of the smallest available. Large FPGAs can have many Mb of block RAM. Side note: most hardware documentation will give memory numbers in bits, written with a lowercase b (e.g. Kb), as opposed to bytes, usually written with a capital B (e.g. KB).
The second point is that loops help us make trade offs between space and speed. In the last installment, everything was in parallel and took up multiple adders and multipliers. In this installment we've used loops, which will take more execution time, but only use one multiplier and one adder. It's not all not all or nothing; an intermediate option is to partially unroll loops to balance that speed vs space trade off.
Here's a challenge for you. The multi-class perceptron example above requires 3 x 4 iterations which are executed sequentially (it is possible to execute iterations in parallel but that is an advanced topic that will not be covered until later). How could you rewrite this example to take advantage of the embedded memory's parallelism and compute the 3 outputs in parallel?
Next stepsUp to this point, we've only talked about space at a pretty high level. Next time, I'll cover in more depth what's inside the FPGA, how space is measured, how much is available, and how to find out exactly how much a particular function is using.
Continue to Part V: Inside the FPGA
ConnectPlease follow me to stay up-to-date as I release new installments. There is also a Discord server (public chat platform) for any comments, questions, or discussion you might have at https://discord.gg/3sA7FHayGH
Comments