In the previous tutorial, we have seen that we can get cycle information from the AI Engine simulation which is cycle approximate so which gives a very close estimation of what we will get on the actual Hardware. I thought it would be interesting to measure the latency from the traces generated by the ai engine simulator.
Note: This tutorial was created using AMD Vitis 2025.1. Tool flow may vary in other versions of the tool.Measuring the latency of an AI Engine graph
In the previous tutorial, after running the AIE Simulator we were getting the following output in the output file generated from the simulation (output.txt):
T 1510400 ps
0 2
T 1516800 ps
4 6
T 1523200 ps
8 10
The lines starting with a T gives the timestamp of the sample which are on the following lines.
The first one, 1510400 ps means that the first sample is present at the output of the AI Engine array after 1.5104 us. Thus this is approximately the initial latency of the AI Engine array (from first sample in to first sample out).
If we check the timestamps between to consecutive lines, we can see that each consecutive samples arrive after 6.4 ns or 156.25 MHz.
We can find a similar number when looking in the platform properties (open the AI engine component setting file vitis-comp.json and click on Platform Information)
We can see that this frequency matches the frequency for the PL clock we have in the platform
Then looking at the lines 63-65 of the output file generated from the simulation (output.txt) we can see the following:
T 1708800 ps
TLAST
30 28
The TLAST means the following sample is the last sample of the graph iteration. It means that our graph (with the 2 kernels) took 1.70 us to complete the first graph iteration. This time includes the time to fill the input buffer, running the 2 kernels and outputting the output buffer.
Then looking at lines 128-130 of the output.txt file:
T 2662400 ps
TLAST
30 28
This is showing the timestamp of the last sample of the graph iteration. It means that the second iteration is completed in 0.9536 us. This is much faster than the first iteration. The reason for that are the ping/pong buffers as the input of the graph as we have seen in the previous tutorial.
While to start the first iteration of the graph we have to wait for the ping buffer to be full with the input data, this is not the case for the second iteration because the pong buffer was filling up while the kernel was working on the ping buffer.
To get a better view of the latency introduce by the different stage, one good way is to look at the traces which can be generated from the simulation.
Traces are not enabled by default. We need to enable them in the simulation options by enabling the option EnableTraces in the simulation settings files (launch.json)
If we run the simulation again we will be able to see the traces under REPORTS > Trace.
This is the view we are getting from the trace report
We can see that there is various information organized by AI Engine Tiles such as the function running, the locks or the DMAs.
The light green and light blue boxes are indicating the kernel execution. As we have seen in the previous articles, the graph (and thus the kernels) is run 4 times and the kernels, which are executed on the same AI Engine tile, are executed one after the other.
If we place the cursor on the end of the first run of the second kernel, we can see it indicates 1.460 us which is quite close from the number we are getting from the first timestamp in the simulation file for in initial latency.
Then we can add cursors at the beginning of the first and second execution of the kernel first.
The difference in time between the 2 kernels (1475 - 524 ns) is 951 ns. This is also close to the values we were getting when measuring the time for the second graph execution minus the time of the last samples of the first iteration from the timestamps of the output text file.
This is basically the time our graph needs to complete an iteration (including the execution of the 2 kernels) when the data is pipelined (the graph is not waiting for data to run the processing).
SummaryIn this article, we have seen how to measure the latency from the AI Engine simulation output text file and how to enable and analyze the traces from the simulation to get a more granular measurement of the latency of the graph.
In the next article we will see how we can improve the latency of our graph.
Disclaimers- AMD, Versal, and Vitis are trademarks or registered trademarks of Advanced Micro Devices, Inc.
- Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
Comments