Abstract:Describing a combinational task in HLS is very important as it has a direct impact on the whole system performance. Here, taking a simple example, I will explain this matter.Introduction
A high-level synthesis tool converts an algorithm to an equivalent RTL description. This description represents a logic circuit, which can be implemented by the ASIC or FPGA technologies.
A logic circuit can be one of the two types: combinational or sequential. The outputs of a combinational circuit are only a function of the current logic values on its inputs. As shown in Figure 1, only basic logic gates can be used to implement a combinational circuit, and no memory cells are required.
On the other hand, the outputs of a sequential circuit not only depend on the current values on its inputs but also rely on the history of input values over the past time.
Circuit states usually model the impact of the history of the input values. A set of memory cells can represent these states.
Figure 2 shows the structure of a sequential circuit consisting of a combinational circuit and a set of memory cells that save the circuit states. The memory cells can be in the form of flip-flops, BRAMs or DDR memories.
The combinational part receives two groups of data: primary inputs and states. Then it generates two groups of outputs: primary outputs and the next state. Whereas other modules in the system use the primary outputs, the following state data modify the memory cells and define the new circuit state.
MotivationAll combinational circuits require a timing interval to generate the stable outputs after any changes on their inputs. This timing is called propagation delay. Different paths from inputs to outputs in a combinational circuit may have various delays. The longest path also called the critical path, defines the design propagation delay.
In sequential circuits, the clock period has a direct impact on design performance. The propagation delay of the combinational part in Figure 2 determines the minimum clock period. Therefore, it has a direct impact on the whole system performance.
A sequential circuit usually needs a few clock cycles to finish its associated task. The maximum number of required clock cycles is called design latency. The combinational part also has a direct impact on the latency of the associated sequential circuit.
Therefore, knowing how to design an efficient combinational circuit in HLS is the first step towards developing a high-performance algorithm on hardware.
Impact of combinational circuitsHere, using an example, I will explain how proper combinational design C/C++ description can lead to faster implementation.
Let’s assume we are going to show four decimal digits of a number on four seven-segments available on a board such as Basys3 FPGA evaluation board shown in Figure 3.
The first step is extracting the four decimal digits, then find the 7-segment codes corresponding to each digit and send the codes to the segments on the board. Here, I will only explain the first task, which is extracting the four decimal digits.
Let’s consider the following Vivado-HLS code which extracts the decimal digits of a 4-digit unsigned integer number. The design top-function is extract_decimal_digits which accepts an input argument (i.e., a) and generates four outputs(i.e., digit_1, digit_2, digit_3, digit_4). It calls the function get_digit, four times, to extract each digit. The get_digit function extracts the first digit of its received number and modifies that afterwards.
If we perform the high-level synthesis process to generate the equivalent RTL design, then Figure 5 shows the report for the Basys3 board. The Vivado-HLS synthesis process utilises the parameterised Xilinx LogiCORE divider core to implement the modulus operator. This code has a pipelined structure.
As can be seen, the design clock period constraint is 10 ns (annotated by 1). The pipelined design requires 35 cycles to finish its tasks which means 0.35 us (denoted by 2). In addition, it utilises 12 DSPs, 1474 FF and 1057 LUTs.
Now let’s consider the following implementation which I have replaced the modulus operator with its equivalent arithmetic expression that is, a%10 = a – 10*(a/10). If we just use this expression directly, the compiler optimises the code, uses the modulus operator again, and generates the same RTL description. To stop the compiler from optimising the code, I have used a separate sub-function to perform the division by-10 operator. In addition, I have turned off the compiler function inlining feature.
Now, if we synthesise this code, Figure 7 shows the corresponding report.
The circuit is fully combinational. The circuit propagation delay is 23.607 ns, and it utilises 28 DSPs and 262 LUTs.
It is a good idea to compare the two implementations. Figure 8 shows this comparison. In this figure, “Solution 1” corresponds to the first implementation that uses the modulus operator, and “Solution 2” represents the second implementation.
As can be seen, the first implementation requires 35 clock cycles, and as the clock period is 10nsec, it takes 350ns to generate the output. However, the second implementation only needs 23.607 ns to generate the output. So the second implementation is 14.83 times faster.
In addition, the second implementation utilises much fewer resources on the FPGA.
Designing an efficient combinational circuit is the first step in developing an algorithm or a system controller in HLS. Several optimisation techniques and coding styles are available to describe the combinational part of a complex algorithm. If you are interested in learning them, you can refer to “Digital System Design with High-Level Synthesis for FPGA: Combinational Circuits”.
Comments