Published October 20, 2025 © MIT

10 Improving AMD AI Engine graph latency and throughput

In this tutorial, we are looking at some ways to improve the latency and throughput of our AI Engine application.

IntermediateProtip30 minutes139

Things used in this project

Software apps and online services

AMD Vitis Unified Software Platform

Story

Introduction

While AI Engine is software programmable, to get the best results in improving latency and throughput on AI Engine, it is important to understand what is happening on the actual hardware. If you are an FPGA designer, you will find a lot of parallel with coding for FPGA.

Let's try to break out the different part of the graph to understand step by step what is happening on the hardware and how we can improve it.

In this tutorial, we are still using the same ai engine application from the acceleration system from this tutorial.

We tried to understand the code in this tutorial, ran it in simulation in this tutorial and tried to estimate the latency of the graph in this tutorial.

Note: This tutorial was created using 2025.1. Tool flow may vary in other versions of the tool.

Note 2: You can use the following project to rebuild a AMD Vitis workspace to follow the steps in this tutorial: https://github.com/xflorentw/AI_Engine_Basic/tree/main/01_Simple_AIE-ML
Run make all to build the workspace before any graph optimization.

Input PLIO to input memory

As we have seen in this tutorial, the kernel first is processing data from an input buffer. When working with buffers on AI Engine, the kernel is waiting for the buffer to be filled with the amount of data set in the graph before starting.

Input PLIO filling the buffer before the kernel start

When running the AI Engine simulation with Traces enabled, we can see that the first iteration of the kernel first is starting after 523ns. It means that the ping buffer is filled with the 32 samples after about 523 ns.

First call of the kernel first

The latency before the input buffer is full will depend on 2 factors:

The speed of the PL feeding the input PLIO
The size of the input buffer. The less data needed to fill the buffer the faster your kernel will start. However, this is only reducing the initial latency and will increase the overall latency of the graph. Because if you are working with smaller set of data you will need to switch context more frequently. So you will add latency to acquire the buffer locks and if your kernel is pipeline, you will get more instances for which you have to fill the pipeline. So in general, to improve the throughput of the graph, you want to increase the size of the buffers.

In the previous article, we have seen that the timestamps between 2 output samples were spaced by 6.4 ns which correspond to a frequency of 156.25 MHz which is the default PL frequency of our platform.

This means that the AI Engine simulator is assuming that our PL feeding and receiving data from the AI Engine is running at 156.25 MHz.

However, looking at the datasheet for the AMD Versal AI Edge devices (DS958) we can see that the AI Engine to PL interface can run at a frequency up to 500 MHz for -1L speed grade devices (as the one on the TE0950 board).

AI Engine Switching Characteristics (source AMD DS958)

We can tell the AI Engine compiler that we intend to run the PLIO at 500 MHz by adding the frequency as a parameter when declaring the input PLIO in project.h

in  = input_plio::create(plio_32_bits, "data/input.txt",500);

After making the change and running the AI Engine simulator with Traces enabled we can see that the first call of the kernel first is now happening after 368ns

First call of the kernel first after changing PLIO input frequency to 500MHz

There is a way we can even improve this latency.

We have now the samples arriving at the boundary of the AIE-ML at 500MHz. And the AI Engine, on -1L speed grade, is running at 1000 GHz.

Currently, we are sending the PL data by packets of 32-bits at 500MHz. And inside the AIE-ML array the streams are 32-bit at 1000MHz. Which means that through the clock domain crossing we are only using sending half of the bandwidth the AIE-ML stream can support.

To optimize this, the native PL interfaces of the AIE-ML are actually 64-bits. This way you can get the same bandwidth from PL to the AI Engine Interface (64-bits at 500MHz) and from the AI Engine Interface to the AIE-ML array.

PL to AIE-ML array interfacing (source: AMD AM020)

The size of the PLIO interface is also defined in the PLIO declaration in graph.h. We can just change it to plio_64_bits

in  = input_plio::create(plio_64_bits, "data/input_64.txt",500);

We also have to change the simulation file to have 2 samples per line (i.e. 4 numbers, 2 for each of the complex integer 16 sample). I have attached the file input_64.txt to this page.

After making the change and running the AI Engine simulator with Traces enabled we can see that the first call of the kernel first is now happening after 337ns

First call of the kernel first after changing PLIO width to 64-bit

This is basically very close to the best "initial" latency we can get without reducing the size of the input buffer (or using streams).

Note: This is not really the latency just to fill the buffer. This also includes some latency to initialize the array (as shown with the _main_init line in grey on the traces). Thus, if the AIE-ML array is already initialized and waiting for data, the latency from first sample in to kernel starting would be much lower

Then once the kernel first is running reading data from the ping buffer and writing to its output buffer, the input pong buffer is filling in parallel to be ready for the next kernel call. There is no more improvement we can make on the input side from the graph level

Kenel first processing data from input ping buffer

We can reduce the time taken by the kernel to process the data but this is something we will see in the next tutorial.

Kernel parallelization

We have seen in this previous tutorial that the kernel first and second are running on the same tile, one after the other, and working on the same buffer, i.e. there is no double buffer between kernel first and kernel second.

This is a great resource saving (and power saving) if you do not need to process the data faster. In this scenario, kernel second is waiting for kernel first to complete its processing to get input data. This would be the same case for the first iteration if the kernels were working from the same tile. But then when the kernel second is running, kernel first is not running even if there is data available from the pong input buffer.

In this scenario we can see that the first graph iteration (the end of kernel second call) is finishing after about 1, 270ns.

End of the first graph iteration

And the fourth (last iteration of our simulation) is finishing after 4, 123ns

End of the fourth iteration of the graph

If you want to get a high throughput, you might want to have the kernels running on different AIE-ML. We have seen that the runtime<ratio> attribute was determining if kernels were getting merged in the same tile. So if we want the max throughput we can just say that our kernels will need all the processing time available in a tile by setting the runtime<ratio> to 1 (maximum value)

runtime<ratio>(first) = 1;
runtime<ratio>(second) = 1;

We can see in the array report that the 2 kernels are now implemented on 2 different tiles

Array report showing the 2 kernels implemented in separated tile

And from the graph report we can see that there is a double buffer implemented between the 2 kernels, which means that the kernels will be able to run in parallel.

Graph report showing a double buffer between the 2 kernels

After running the AI Engine simulator with Traces enabled we can now see the 2 kernels running in parallel. We can see that the second call of kernel first is happening at the same time as the first call of kernel second.

When looking at the time of the end of the first call of kernel second, it is happening after 1, 290ns. So slightly more than in when the kernels were sharing the same tile. This can be explained by the fact that each kernels have to manage the locks for the double buffer they use for communicating.

End of the first graph iteration with the kernels on different tiles

It is expected to see a similar time for the first execution of the graph because even if the kernels are running from the different tile, thus not sharing the processing time of that tile, kernel second still need to wait for kernel first to complete its first execution before having the input data available through the ping buffer.

But then if we look at the fourth iteration of the kernel second, it happens after 2, 751 ns, so we can see a big improvement of the latency and throughput.

End of the fourth graph iteration with the kernels on different tiles

This number can easily be explained. The end of the 4th kernel second call is happening 1372ns faster ( 4, 123ns - 2, 751ns). We can see that 3 iterations of the kernels are happening in parallel and each kernel iteration needs about 460 ns to complete.

Output memory to output PLIO

The latency on the output PLIO is a similar story as we have seen on the output. When checking the output text file (output.txt) we can see the following first lines:

T 1344 ns
0 2 
T 1350400 ps
4 6 
T 1356800 ps
8 10 
T 1363200 ps
12 14

We can see that the first sample is happening after 1, 344 ns. We can already see the improvement on the latency on our system as if you remember from this previous tutorial before the changes the first input samples was arriving after 1, 5104 ns.

We can also see that the following consecutive samples are happening after 6.4ns reminding of the 156.25 MHz frequency we had on the input.

We can do the same settings in project.h for the output PLIO as we did for the input PLIO: change the frequency to 500MHz and change the data width to 64-bit.

out = output_plio::create(plio_64_bits, "data/output.txt",500);

After building the graph and running through AI Engine simulator we can see the following first lines the output text file (output.txt):

T 1324 ns
0 2 4 6 
T 1326 ns
8 10 12 14 
T 1328 ns
16 18 20 22 
T 1330 ns
24 26 28 30

We can see that we have slightly improved the latency on the first sample but most importantly the consecutive samples are now only 2ns apart and we are receiving 2 samples each time, greatly improving the throughput of the system.

Note 2: You can use the following project to rebuild a AMD Vitis workspace to get the final version of the project after the steps mentioned in this tutorial: https://github.com/xflorentw/AI_Engine_Basic/tree/main/01_Simple_AIE-ML

Run make all VERSION=2

Summary

In this tutorial, we have seen how to modify the graph file to achieve lower latency and higher throughput on our AI Engine application. There are still improvement to the kernel code that we can make by vectorizing it to improve the throughput and latency of the system. This is what we will see in the next tutorial.

Disclaimers

AMD, Versal, and Vitis are trademarks or registered trademarks of Advanced Micro Devices, Inc.
Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

input_64.txt

0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1 0 3 2
5 4 7 6
9 8 11 10
13 12 15 14

Credits

Florent Werbrouck

10 projects • 7 followers

Passionate about FPGA devices

10 Improving AMD AI Engine graph latency and throughput

Things used in this project

Software apps and online services

Story

Introduction

Input PLIO to input memory

Kernel parallelization

Output memory to output PLIO

Summary

Disclaimers

Code

input_64.txt

AI Engine Basic Projects

Credits

Florent Werbrouck

Comments

Embed the widget on your own site

10 Improving AMD AI Engine graph latency and throughput

10 Improving AMD AI Engine graph latency and throughput

Things used in this project

Software apps and online services

Story

Introduction

Input PLIO to input memory

Kernel parallelization

Output memory to output PLIO

Summary

Disclaimers

Code

input_64.txt

AI Engine Basic Projects

Credits

Florent Werbrouck

Comments

Related channels and tags