Last week, we looked at how we could work with loops in our HLS source code, exploring how we can flatten, merge and unroll loops.
This week, we are going to look at how we can work with the HLS analysis perspective, so we can understand where our optimizations will have the best impact in our HLS source.
The analysis perspective is populated with information once we have run HLS synthesis and presents information regarding the hierarchy, resource utilization and most importantly latency of our design.
For our first example, let’s take a look at the loop merge example from last week. With individual loop the latency is 353 clocks, between the ability to process a new input, this is reported in the synthesis view; however, if we want to optimize the performance, we need to know a little more about where the bottlenecks are.
This information is provided by the analysis perspective.
Within the analysis perspective, we will several windows that enable us to optimize further our design.
Taking the loop merge example from last week under the performance profile, you will notice three loops contained within the accum function. For each of the loops, the total latency is provided, which is the iteration latency multiplied by the trip count.
When it comes to optimizing the performance, we first want to focus upon the iteration latency. This is the time taken for each cycle of the loop to be implemented.
Once we have optimized the latency of each loop iteration, we can then work on optimizing the number of times the loop executes, e.g. the trip count.
In the case of the loop merge example to increase performance, the first step is to decrease the number of clock cycles required for each iteration of the loop.
To understand more about what operations in a loops iteration increases the latency, we can use the schedule viewer.
The schedule viewer allows us to see not only the operations but also cross probe to the source to understand which line of code causes this operation.
In the example below, you can see the load operation due to the read from the block RAM holding the values of variable A increases the latency. This takes over one clock cycle resulting in a increased latency.
We can use this information to fracture the BRAM, which contains the variable A. This is achieved by using an array partition pragma:
#pragma HLS ARRAY_PARTITION variable=a complete dim=1
Re-synthesizing the design now shows a reduced overall latency as the latency of loop_1 has been reduced by one third to 100 clocks. As each iteration of loop one now takes two clocks compared to the previous three clocks. However, the overall latency is still high at 303 clocks.
Exploring the schedule view in the analysis perspective this time if loop_1 is expanded show 50 entries for variable A as it has been completely fractured. It will also confirm the iteration latency is now two clocks for the loop.
If we apply the BRAM partition pragma to variable A and merge the loops as we did last week, we will achieve better performance than we do when just the loops are merged.
Reducing the iteration interval from 102 clocks previously to 52 clocks. Overall this is a very significant reduction from the original 354 clock initiation interval.
Of course, as we have optimized for performance, we have traded increased logic resources for performance. But in this simple case, we have been able to increase the performance without unduly utilizing significant logic resources as such the trade-ff is very acceptable.
Hopefully now you understand a little more about how we can use the analysis perspective to optimize our HLS design. It really is very powerful and easy to use!
See My FPGA / SoC Projects: Adam Taylor on Hackster.io
Get the Code: ATaylorCEngFIET (Adam Taylor)
Access the MicroZed Chronicles Archives with over 300 articles on the FPGA / Zynq / Zynq MpSoC updated weekly at MicroZed Chronicles.