Published October 13, 2025 © MIT

09 Analysis of the latency and of an AI Engine graph

In this project, we are trying to measure and understand the latency of an AMD Versal AI Engine application

IntermediateProtip1 hour46

09 Analysis of the latency and of an AI Engine graph

Things used in this project

Software apps and online services

AMD Vitis Unified Software Platform

Story

Introduction

In the previous tutorial, we have seen that we can get cycle information from the AI Engine simulation which is cycle approximate so which gives a very close estimation of what we will get on the actual Hardware. I thought it would be interesting to measure the latency from the traces generated by the ai engine simulator.

Note: This tutorial was created using AMD Vitis 2025.1. Tool flow may vary in other versions of the tool.

Measuring the latency of an AI Engine graph

In the previous tutorial, after running the AIE Simulator we were getting the following output in the output file generated from the simulation (output.txt):

T 1510400 ps
0 2 
T 1516800 ps
4 6 
T 1523200 ps
8 10

The lines starting with a T gives the timestamp of the sample which are on the following lines.

The first one, 1510400 ps means that the first sample is present at the output of the AI Engine array after 1.5104 us. Thus this is approximately the initial latency of the AI Engine array (from first sample in to first sample out).

If we check the timestamps between to consecutive lines, we can see that each consecutive samples arrive after 6.4 ns or 156.25 MHz.

We can find a similar number when looking in the platform properties (open the AI engine component setting file vitis-comp.json and click on Platform Information)

Opening the Platform Information window

We can see that this frequency matches the frequency for the PL clock we have in the platform

Platform information window

Then looking at the lines 63-65 of the output file generated from the simulation (output.txt) we can see the following:

T 1708800 ps
TLAST
30 28

The TLAST means the following sample is the last sample of the graph iteration. It means that our graph (with the 2 kernels) took 1.70 us to complete the first graph iteration. This time includes the time to fill the input buffer, running the 2 kernels and outputting the output buffer.

Then looking at lines 128-130 of the output.txt file:

T 2662400 ps
TLAST
30 28

This is showing the timestamp of the last sample of the graph iteration. It means that the second iteration is completed in 0.9536 us. This is much faster than the first iteration. The reason for that are the ping/pong buffers as the input of the graph as we have seen in the previous tutorial.

While to start the first iteration of the graph we have to wait for the ping buffer to be full with the input data, this is not the case for the second iteration because the pong buffer was filling up while the kernel was working on the ping buffer.

To get a better view of the latency introduce by the different stage, one good way is to look at the traces which can be generated from the simulation.

Traces are not enabled by default. We need to enable them in the simulation options by enabling the option EnableTraces in the simulation settings files (launch.json)

Enabling Traces generation for AI Engine Simulation

If we run the simulation again we will be able to see the traces under REPORTS > Trace.

Opening the Trace report

This is the view we are getting from the trace report

Trace view

We can see that there is various information organized by AI Engine Tiles such as the function running, the locks or the DMAs.

The light green and light blue boxes are indicating the kernel execution. As we have seen in the previous articles, the graph (and thus the kernels) is run 4 times and the kernels, which are executed on the same AI Engine tile, are executed one after the other.

If we place the cursor on the end of the first run of the second kernel, we can see it indicates 1.460 us which is quite close from the number we are getting from the first timestamp in the simulation file for in initial latency.

Then we can add cursors at the beginning of the first and second execution of the kernel first.

Cursors between 2 calls of the first kernel

The difference in time between the 2 kernels (1475 - 524 ns) is 951 ns. This is also close to the values we were getting when measuring the time for the second graph execution minus the time of the last samples of the first iteration from the timestamps of the output text file.

This is basically the time our graph needs to complete an iteration (including the execution of the 2 kernels) when the data is pipelined (the graph is not waiting for data to run the processing).

Summary

In this article, we have seen how to measure the latency from the AI Engine simulation output text file and how to enable and analyze the traces from the simulation to get a more granular measurement of the latency of the graph.

In the next article we will see how we can improve the latency of our graph.

Disclaimers

AMD, Versal, and Vitis are trademarks or registered trademarks of Advanced Micro Devices, Inc.
Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

Credits

Florent Werbrouck

9 projects • 6 followers

Passionate about FPGA devices

09 Analysis of the latency and of an AI Engine graph

Things used in this project

Software apps and online services

Story

Introduction

Measuring the latency of an AI Engine graph

Summary

Disclaimers

Code

TE0950 Designs Repository

Credits

Florent Werbrouck

Comments

Embed the widget on your own site

09 Analysis of the latency and of an AI Engine graph

09 Analysis of the latency and of an AI Engine graph

Things used in this project

Software apps and online services

Story

Introduction

Measuring the latency of an AI Engine graph

Summary

Disclaimers

Code

TE0950 Designs Repository

Credits

Florent Werbrouck

Comments

Related channels and tags