In the previous tutorial, we have seen how to improve the performance of an AMD AI Engine application on the graph level. But our application is not yet fully optimized as we will see that our kernel is not using the Very Long Instruction Word (VLIW) or Single Instruction Multiple Data (SIMD) capabilities of the AIE-ML.
In this tutorial we will do an analysis of the current kernel code to understand how we can improve it.
Note: This tutorial was created using AMD Vitis 2025.1. Tool flow may vary in other versions of the tool.
Note 2: You can use the following project to rebuild a AMD Vitis workspace to follow the steps in this tutorial: https://github.com/xflorentw/AI_Engine_Basic/tree/main/01_Simple_AIE-MLKernel Code performances analysis
Runmake all VERSION=2to build the workspace before any graph optimization.
We have seen in the previous tutorial that our kernel is taking 461 ns, which equals to 461 clock cycles as the AIE-ML is running at 1GHz on the -1L speed grade, to complete the execution.
One other thing we can see on the traces is if we zoom in on the output of kernel first. We can see that all the samples are spaced by 14 ns or 14 AI Engine clock cycles.
If you remember the kernel code for this previous tutorial, the kernel is processing the samples one by one through a for loop
void simple(adf::input_buffer<cint16> & in, adf::output_buffer<cint16> & out) {
[...]
for (unsigned i=0; i<NUM_SAMPLES; i++) {
[...]
}
}The number of samples processed by each iteration of the kernel is defined in include.h and is set to 32
#define NUM_SAMPLES 32If each sample is taking 14 ns to produce, it means the kernel is taking at least 448 ns plus some overhead. We are getting to the 461 clock cycles.
Let's try to understand a bit further the execution of the kernel by running the debugger.
To run the debugger, click on Debug under AIE SIMULATOR / HARDWARE in the flow navigator
A new view will open. First we will want to step through the for loop of the kernel code so we can add a breakpoint to stop at the entry of the loop. For this, open the kernel.cc source file and add a breakpoint by clicking on the left side (on the left of the line numbers) on line 11.
Then, I we would like to understand the execution of the code in terms of clock cycle, we might want to have a view of the assembly code resulting from the kernel compilation. For this click on View > Disassembly View
Finally, for the debugger to run step by step through instructions we need to set Step by instruction in the debug options.
We can see that the 2 cores are paused on breakpoint (this is the breakpoint at the main entry, not the breakpoint we added on the kernel code).
For the moment, we only need to run core [6, 0] as this is the tile which runs the kernel first (the kernel second will need to wait for kernel first to complete to get the input data anyway). Select core [6, 0] and click on the continue icon
We can see in the kernel.cc code that the core is now stopped inside the for loop.
We can see in the Disassembly view that we are inside the kernel simple. We do not really need to understand the kernel disassembly code. But it is good to know that each line is a line of Very Long Instruction Word (VLIW) code. You can see that for each line there is multiple column (there are up to 6 columns on AIE-ML). Each column is a slot of the VLIW instruction, and each slot has a specific instruction. Meaning the AIE-ML can execute up to 6 actions in parallel.
One thing we can note is that we see a lot of NOPX, NOPA,NOPB,NOPS or NOPM instructions. These instructions are No Operation instructions which means that we are not doing anything in those slots. So for most of our code we are not rally utilizing much of the processor. But this is because the main part of the AIE-ML is the vector processor and the current code is not exercising it because the code is not vectorized (this is what we will improve in the next section).
Now let's step through the code to understand the time it takes to execute. Click on the Step Into icon or press F11 and count how many cycles it takes to execute line 11 (which is the input sample load).
We counting each execution we are getting approximately the following:
- 7 instructions/cycles for the load
- 2 instructions/cycles for the Add operation
- 4 instructions/cycles for the Sub operation
- 1 instructions/cycles for the store operation
We are getting the 14 clock cycles per output sample that we can observe on the traces.
It is important to note that this is not the instruction itself that is taking the observed number of cycles. This is the instruction in the context of the code. Based on the resources used, the pipeline in the AIE-ML code might not be used efficiently. And this usually the case when we see many NOP instructions which are here to wait for the resource to be available.
SummaryIn this article, we have seen how to analyze the performance of an AI Engine kernel using the AI Engine simulator in debug mode in the AMD Vitis AMD.
In the next article we will see how we can improve the performance of our kernel by vectorizing the code.
Disclaimers- AMD, Versal, and Vitis are trademarks or registered trademarks of Advanced Micro Devices, Inc.
- Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.





Comments