Published February 22, 2022

Hardware-as-Code Part III: Space vs Time

Multi-part tutorial series teaching makers and software developers how to implement algorithms in hardware with only C/C++.

BeginnerProtip753

Hardware-as-Code Part III: Space vs Time

Things used in this project

Hardware components

UPduino

Software apps and online services

PlatformIO IDE

Story

In this week's installment, it's all about performance. One of the common uses of an FPGA is to increase performance in terms of speed, and/or energy efficiency. This is achieved in part by 1) removing the interpretation overhead of instructions, 2) eliminating the central-memory bottleneck, and 3) instruction-level parallelism.

If you are new to this series, you may want to go back to Hardware-as-Code Part I.

Performance of example 1

In part II, we generated a hardware implementation of the following simple function:

int16_t calc(int16_t x) {
  return 7 * x - 15;
}

Let’s take a closer look at the performance of this function both as software executed on a CPU and as a custom hardware function on an FPGA. First, consider the energy used to execute this function on a CPU. A typical small CPU will consist of the following functional hardware blocks:

1. Instruction fetch
2. Instruction decode
3. Memory argument fetch
4. Execute instruction
5. Write back result

The first two of these are completely eliminated by the FPGA function. Those are purely interpretation overhead required by the CPU model. Number 3 and 4 are dedicated to data movement necessitated by the central-memory model. FPGAs often do not require external memory and this would again just be overhead completely eliminated. Number 4 is the only part of the CPU that is actually performing application-specific functionality. However, all of these units are consuming energy continuously throughout program execution. Additionally, external memories also consume a considerable amount of power.

What about the execution time? Let’s, estimate the number of cycles this might require to execute on a small CPU:

Load x into a register (2 cycles)
Load 1st constant into a register (1 cycle)
Multiply (1 cycle)
Load 2nd constant into a register (1 cycle)
Add (1 cycle)
Store result to memory (2 cycles)

That’s a total of 8 cycles! Of course, for a CPU with a larger instruction set and more complicated instructions you might be able to use fewer instructions, but then these usually take more cycles. So let’s say 4-8 cycles.

Now for the FPGA implementation, we have a single-cycle circuit that performs both the multiple and the addition. It’s like having a custom instruction built specifically for this application. Values are passed around via registers and there is no memory access. CPUs can sometimes take advantage of registers to pass values, but usually only a small number of them are available. Additionally, many functions will require stack memory to store local variables and temporary values.

Not convinced? Let’s expand this first example slightly to address a real world problem and see how that compares.

Machine learning classification example

Classification is a very common task performed by machine learning. The classification task is to categorize something into 2 or more classes based on some data you have about it. For example, based on some vibration sensor data attached to a fan, classify it as working or not working (motor failure/stuck propeller).

Let's implement a simple classification of some data into two classes based on two measurements. The measurement data for a number of objects with known class are shown in the following graph.

Each point represents the two measurements of an example object and the color represents the known class of that object. The goal is, based on the two values (x, y) of a new object, predict if it is in the orange class or blue class. As you can see from the graph, the orange class objects are all to the left of both the green and blue lines. The blue class objects are all to the right of one or both lines.

Let's implement a simple predictor function that just tests if a new point is to the left of both lines:

This code is also available from the git repo: https://github.com/sathibault/hac-examples.git in the poly-classify folder.

Go ahead and build and test this classify function on both your computer and the FPGA board (if you need to review how that is done, go back to Part II). You should see output like the following:

poly-classify>.\program
classify(7, 82) = 1
classify(5, 100) = 1
classify(10, 70) = 0
classify(15, 100) = 0

The 1 output indicates that the point is to the left of both lines and the predicted class is orange. Otherwise, the predicted class is blue.

I picked this example because it is simple to explain and is very representative of the computation required by the extremely successful neural networks in use today.

Instruction-level parallelism

In addition to eliminating the central memory bottleneck and interpretation overhead of the CPU, custom hardware also enables a high level of instruction-level parallelism. For this second example, the generated hardware for the classify function looks like the following:

As you see, each equation has its own dedicated multiplier and adders. Although we’ve quadrupled the amount of work we are doing relative to the first example, this entire function still executes in a single cycle! Functionality that would normally correspond to many instructions on a CPU and execute sequentially, can be executed in parallel. Try to estimate how my instructions/cycles this function would require on a CPU.

I’m really emphasizing the negative aspects of the CPU approach, but it’s not all roses for the FPGA. We can make some significant gains in speed and power consumption, however the downside is that it takes up physical space. Each of the blocks in the diagram above take up space on the FPGA. As a function grows, so does the amount of space it’s going to take up, and there is only a limited amount available. Although space can also be an issue for the program memory of a microcontroller, the space constraints of an FPGA are generally more limiting. We will look in more detail at space usage in an upcoming installment.

Next steps

So far we have been looking at simple straight-line code examples. Next time, we will look at the use of loops and arrays.

Continue to Part IV: Embedded RAM

Connect

Please follow me to stay up-to-date as I release new installments. There is also a Discord server (public chat platform) for any comments, questions, or discussion you might have at https://discord.gg/3sA7FHayGH

Credits

Scott Thibault

7 projects • 25 followers

Doctorate in programming languages + experience in FPGA, design automation, embedded systems, and machine learning.

Hardware-as-Code Part III: Space vs Time

Things used in this project

Hardware components

Software apps and online services

Story

Performance of example 1

Machine learning classification example

Instruction-level parallelism

Next steps

Connect

Code

Hardware-as-Code Examples

Credits

Scott Thibault

Comments

Embed the widget on your own site

Hardware-as-Code Part III: Space vs Time

Hardware-as-Code Part III: Space vs Time

Things used in this project

Hardware components

Software apps and online services

Story

Performance of example 1

Machine learning classification example

Instruction-level parallelism

Next steps

Connect

Code

Hardware-as-Code Examples

Credits

Scott Thibault

Comments

Related channels and tags