I've taken a quick-start approach to this tutorial, jumping right into writing code early. It's not necessary to understand too much about an FPGA to write code, just like you write software for a CPU without really knowing details about CPUs. However, the FPGA is a resource-constrained device and at some point you will want to understand what those resources are and how they are measured. Today we will take a closer look at what makes up an FPGA so you can do just that.
If you are new to this series, you may want to go back to Hardware-as-Code Part I.
The FPGA elementsLet's begin by introducing the main elements found inside the FPGA.
Programmable Logic - This is the main and most plentiful resource available in the FPGA. As I described in part I, digital logic is made of logic gates (and, or, not, etc.), and that logic can be used to implement arithmetic operations, and well, everything else a computer does. The programmable logic elements are often referred to as LEs (logic elements) or LUTs. LE capacities can range from several hundreds to over a million in the largest FPGAs.
Registers - The values of (non-array) variables are stored in registers which are made up of hardware elements called flip flops. Each flip flop (or just FF) stores a single bit value, so an 8-bit variable would use 8 FFs in hardware. There is often one or two FFs paired with every LE, so the flip flop capacity is usually of the same order as the LEs.
DSP blocks - The so-called "digital-signal processing" blocks primarily consist of a hard-coded (i.e. fixed function, not programmable) multiplier and an adder. There are several variations but the multiplier + adder is the key functionality common to all variations. Although both of these can be implemented purely in programmable logic, the DSP blocks are much more efficient and they are very important to many applications. DSP block numbers range from under 10 to over a 1000.
Synthesis tools will use programmable logic for multipliers and adders if there are not enough DSP blocks available. This is pretty common for adders, and although less efficient, they are still reasonably efficient. Multipliers on the other hand are very wasteful as programmable logic and you may find yourself quickly running out of space if you need many multipliers beyond the number of available DSP blocks.
Embedded RAM - We covered the embedded RAM in detail in part IV. These elements are often referred to as EBR (embedded RAM), Block RAM, or BRAM. The capacity of each embedded RAM is usually in thousands of bits. The number of embedded RAM may range from tens to thousands. Some FPGAs also have multiple sizes in the same FPGA. For example, many smaller RAM and a few large RAM.
I/O - These important elements provide input and output using the external pins of the chip. We won't worry too much about these in this series because we will always be using pre-designed blocks built into the HLS tool that connects our design to the outside world. However, just briefly for completeness... The I/O elements include general purpose I/O (or GPIO) that can either drive a pin with a high/low output value or read a high/low input value. Beyond GPIOs, FPGAs often provide other fixed-function I/O elements that can communicate with external devices using specific protocols like DDR, SPI, I2C, PCIe, etc.
FPGA specificationsPretty much all FPGA vendors offer different series of FPGAs that target different classes of application, called device families. Each family will include several parts at different sizes (both physically and in terms of available resources), packages and special features. When you visit the website for a device family, there will be a family table on the page or attached which summarizes the resources available for each device in the family.
This is the family table for the FPGA on the UPDuino board we are using:
Hopefully, you can now identify the 4 main resources we are interested in (well actually 3, because one is missing). The UPDuino contains an UP5K device (second column), so our board has the following resources:
- Programmable logic: 5, 280 elements (LUTs in the table)
- Registers: not explicitly given, but assume similar to LUT count (The datasheet will have more detail.)
- Embedded RAM: 120Kb (This is the total bit capacity. The datasheet would give details on the number and size of the individual blocks.)
- DSP blocks: 8 (multiply & accumulator blocks in the table)
As a bonus, these chips have some larger block RAMs (SPRAM in the table).
It's usually worth taking a look at the first few pages of the datasheet as well, even if you don't understand most of the rest of the document. After the overview, there is usually a more detailed family table then what you find on the website.
Looking through the datasheet for the UP5K device (download datasheet), we find it has 30 embedded RAM with 4096 bits each (30 * 4Kb = 120Kb), and 4 large RAM blocks of 256Kb (4 * 256Kb = 1Mb).
Routing (the hidden resource)After a design has been mapped into digital logic (and DSP blocks, etc.), the tools must layout those elements on the chip. It's geometrically impossible to layout the design in a way that everything is directly adjacent to the things they are connected to. Consequently, there are "routing" resources which are basically wires and programmable switches that allow different parts of the chip to be connected.
Unfortunately, this is a resource that is difficult to describe quantitatively. So, you won't see any numbers on the FPGA specification to describe the available routing (the datasheet will probably have a description of the routing architecture, however) and there will be little or no output from the tools about how much routing a design is utilizing. At best, you might see a warning like this design has high congestion.
In practice, routing is not something you will worry about directly, but it will impact the maximum achievable chip capacity. The more logic and other elements left unused, the more flexibility there is in layout, and the more likely that the tools are going to find a viable layout. Said the other way around, the higher the percentage of resources you use, the less flexibility there is in layout, and the more likely the tool is to fail to find a viable layout.
I find that once you reach 70-75% utilization of the FPGA, the tools may start to struggle during layout. Once you reach 80% or more, you may run into complete failures. This is just my experience, YMMV. At this point, the easy solution is to move up to a larger FPGA. There are things that can be done to push the limits of device utilization, but those are beyond the level of this series, and will require assistance from a digital hardware designer.
Resource usage reportsNow that you understand at least the most important resources of interest, we can now look at the tool reports that tell us how many resources our design is using. When you click the upload button to build and upload your design to the FPGA, you will see tons of information fly by, including the resource usage information. You can also run these tools without actually uploading to the UPDuino board with the command: pio run --target bitstream
(you may need to clean the project first, if the bitstream was previously built).
When the tool finishes, scroll backwards several pages until you get to a table that looks like this:
Info: Device utilisation:
Info: ICESTORM_LC: 544/ 5280 10%
Info: ICESTORM_RAM: 0/ 30 0%
Info: SB_IO: 6/ 96 6%
Info: SB_GB: 4/ 8 50%
Info: ICESTORM_PLL: 0/ 1 0%
Info: SB_WARMBOOT: 0/ 1 0%
Info: ICESTORM_DSP: 1/ 8 12%
Info: ICESTORM_HFOSC: 1/ 1 100%
Info: ICESTORM_LFOSC: 0/ 1 0%
Info: SB_I2C: 0/ 2 0%
Info: SB_SPI: 0/ 2 0%
Info: IO_I3C: 0/ 2 0%
Info: SB_LEDDA_IP: 0/ 1 0%
Info: SB_RGBA_DRV: 0/ 1 0%
Info: ICESTORM_SPRAM: 0/ 4 0%
This is the output from the layout tool. The first line uses yet another term for logic elements: LC (logic cell), i.e LUT + FF. After that is the embedded RAM and further down the DSPs.
This actual output is from our very first example in Part II: Hello FPGA. This design only uses 10% of the LE resources, so there is plenty of space to do more with it. In fact, a lot of the 10% is actually overhead generated by the tools to connect your function to the USB so we can invoke it from the computer.
To experiment, take any of the examples we've covered and try making various changes to the code to see how this usage-report table changes. 5K is not a lot, but I've used this device to build a number of sophisticated applications such as wake-word detection and image classification.
Perceptron challengeIn Part IV, I challenged you to modify the multi-class perceptron example to compute the 3 outputs in parallel. Here's the solution to that problem:
// Multi-class perceptron is a matrix/vector multiplication
void mat_vec_mul(in_array<int16_t,4> features, out_array<int16_t,4> out) {
int16_t coef0[4] = { -9, 25, -28, -27 };
int16_t coef1[4] = { -3, -31, 58, -31 };
int16_t coef2[4] = { -10, 4, 80, 79 };
int16_t acc0 = 0;
int16_t acc1 = 0;
int16_t acc2 = 0;
for (uint8_t j = 0; j < 4; j++) {
int16_t x = features[j];
acc0 += coef0[j] * x;
acc1 += coef1[j] * x;
acc2 += coef2[j] * x;
}
out.resize(3); // set the actual output size
out[0] = acc0;
out[1] = acc1;
out[2] = acc2;
}
The key to the solution is to take advantage of the fact that each embedded RAM, and thus array variable, can be accessed simultaneously in parallel. By splitting up the coef
variable into three separate variables and unrolling the outer loop, we enable the multiply and accumulate step for each output to occur in parallel. If we had unrolled the loop, but not split up coef
, then the toolchain would still have generated separate hardware for the three multiply and add statements, but they would need to take turns getting values from the coef
array and not actually execute in parallel. As written, the loop above will use 2 cycles per iteration: one to read the arrays and one to perform the multiply and accumulate. In a few weeks, we will learn how this can be further improved to 1 cycle per iteration.
Here's the resource usage for this example:
Info: Device utilisation:
Info: ICESTORM_LC: 686/ 5280 12%
Info: ICESTORM_RAM: 2/ 30 6%
Info: SB_IO: 6/ 96 6%
Info: SB_GB: 8/ 8 100%
Info: ICESTORM_PLL: 0/ 1 0%
Info: SB_WARMBOOT: 0/ 1 0%
Info: ICESTORM_DSP: 3/ 8 37%
Info: ICESTORM_HFOSC: 1/ 1 100%
Info: ICESTORM_LFOSC: 0/ 1 0%
Info: SB_I2C: 0/ 2 0%
Info: SB_SPI: 0/ 2 0%
Info: IO_I3C: 0/ 2 0%
Info: SB_LEDDA_IP: 0/ 1 0%
Info: SB_RGBA_DRV: 0/ 1 0%
Info: ICESTORM_SPRAM: 0/ 4 0%
As expected, this design uses 3 DSP blocks, one for each output. However, you may be wondering why there are only 2 embedded RAM blocks used, even though the function uses 5 arrays. In some cases, if an array is really small, the synthesis tool may use flip flops to store the array date rather than use an embedded RAM. The coefficient arrays here are only 64 bits in total and it would be somewhat wasteful to store that in a 4096 RAM.
Next stepsThis concludes the first half of the tutorial series and I hope at this point you now have a basic concept of how to create custom FPGA designs using C++. However, there still much more learn! Next up, we'll talk about another performance optimization topic: how to eliminate (or hide at least) I/O overhead.
ConnectPlease follow me to stay up-to-date as I release new installments. There is also a Discord server (public chat platform) for any comments, questions, or discussion you might have at https://discord.gg/3sA7FHayGH
Comments