Published October 27, 2025 © MIT

11 AI Engine Kernel Code performances analysis

In this tutorial we are running the AI Engine simulator in debug mode to analyze the performances of a kernel in detail

IntermediateProtip136

Things used in this project

Software apps and online services

AMD Vitis Unified Software Platform

Story

Introduction

In the previous tutorial, we have seen how to improve the performance of an AMD AI Engine application on the graph level. But our application is not yet fully optimized as we will see that our kernel is not using the Very Long Instruction Word (VLIW) or Single Instruction Multiple Data (SIMD) capabilities of the AIE-ML.

In this tutorial we will do an analysis of the current kernel code to understand how we can improve it.

Note: This tutorial was created using AMD Vitis 2025.1. Tool flow may vary in other versions of the tool.

Note 2: You can use the following project to rebuild a AMD Vitis workspace to follow the steps in this tutorial: https://github.com/xflorentw/AI_Engine_Basic/tree/main/01_Simple_AIE-ML

Run make all VERSION=2 to build the workspace before any graph optimization.

Kernel Code performances analysis

We have seen in the previous tutorial that our kernel is taking 461 ns, which equals to 461 clock cycles as the AIE-ML is running at 1GHz on the -1L speed grade, to complete the execution.

Our kernel is taking 461 ns to complete its execution

One other thing we can see on the traces is if we zoom in on the output of kernel first. We can see that all the samples are spaced by 14 ns or 14 AI Engine clock cycles.

The output samples are spaced by 14 ns

If you remember the kernel code for this previous tutorial, the kernel is processing the samples one by one through a for loop

void simple(adf::input_buffer<cint16> & in, adf::output_buffer<cint16> & out) {
  [...]
  for (unsigned i=0; i<NUM_SAMPLES; i++) {
    [...]
  }
}

The number of samples processed by each iteration of the kernel is defined in include.h and is set to 32

#define NUM_SAMPLES 32

If each sample is taking 14 ns to produce, it means the kernel is taking at least 448 ns plus some overhead. We are getting to the 461 clock cycles.

Let's try to understand a bit further the execution of the kernel by running the debugger.

To run the debugger, click on Debug under AIE SIMULATOR / HARDWARE in the flow navigator

Running the AI Engine Debugger

A new view will open. First we will want to step through the for loop of the kernel code so we can add a breakpoint to stop at the entry of the loop. For this, open the kernel.cc source file and add a breakpoint by clicking on the left side (on the left of the line numbers) on line 11.

Adding a breakpoint to the kernel source file

Then, I we would like to understand the execution of the code in terms of clock cycle, we might want to have a view of the assembly code resulting from the kernel compilation. For this click on View > Disassembly View

Opening the disassembly view

Finally, for the debugger to run step by step through instructions we need to set Step by instruction in the debug options.

Set Step by Instruction

We can see that the 2 cores are paused on breakpoint (this is the breakpoint at the main entry, not the breakpoint we added on the kernel code).

For the moment, we only need to run core [6, 0] as this is the tile which runs the kernel first (the kernel second will need to wait for kernel first to complete to get the input data anyway). Select core [6, 0] and click on the continue icon

Run core [6,0]

We can see in the kernel.cc code that the core is now stopped inside the for loop.

Kernel stop on breakpoint inside the kernel for loop

We can see in the Disassembly view that we are inside the kernel simple. We do not really need to understand the kernel disassembly code. But it is good to know that each line is a line of Very Long Instruction Word (VLIW) code. You can see that for each line there is multiple column (there are up to 6 columns on AIE-ML). Each column is a slot of the VLIW instruction, and each slot has a specific instruction. Meaning the AIE-ML can execute up to 6 actions in parallel.

Disassembly code

One thing we can note is that we see a lot of NOPX, NOPA,NOPB,NOPS or NOPM instructions. These instructions are No Operation instructions which means that we are not doing anything in those slots. So for most of our code we are not rally utilizing much of the processor. But this is because the main part of the AIE-ML is the vector processor and the current code is not exercising it because the code is not vectorized (this is what we will improve in the next section).

Now let's step through the code to understand the time it takes to execute. Click on the Step Into icon or press F11 and count how many cycles it takes to execute line 11 (which is the input sample load).

Step Into the code

We counting each execution we are getting approximately the following:

7 instructions/cycles for the load
2 instructions/cycles for the Add operation
4 instructions/cycles for the Sub operation
1 instructions/cycles for the store operation

We are getting the 14 clock cycles per output sample that we can observe on the traces.

It is important to note that this is not the instruction itself that is taking the observed number of cycles. This is the instruction in the context of the code. Based on the resources used, the pipeline in the AIE-ML code might not be used efficiently. And this usually the case when we see many NOP instructions which are here to wait for the resource to be available.

Summary

In this article, we have seen how to analyze the performance of an AI Engine kernel using the AI Engine simulator in debug mode in the AMD Vitis AMD.

In the next article we will see how we can improve the performance of our kernel by vectorizing the code.

Disclaimers

AMD, Versal, and Vitis are trademarks or registered trademarks of Advanced Micro Devices, Inc.
Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

Credits

Florent Werbrouck

12 projects • 9 followers

Passionate about FPGA devices

11 AI Engine Kernel Code performances analysis

Things used in this project

Software apps and online services

Story

Introduction

Kernel Code performances analysis

Summary

Disclaimers

Code

AI Engine Basic Projects

Credits

Florent Werbrouck

Comments

Embed the widget on your own site

11 AI Engine Kernel Code performances analysis

11 AI Engine Kernel Code performances analysis

Things used in this project

Software apps and online services

Story

Introduction

Kernel Code performances analysis

Summary

Disclaimers

Code

AI Engine Basic Projects

Credits

Florent Werbrouck

Comments

Related channels and tags