In the previous tutorial, we have seen that our kernel is not using the Very Long Instruction Word (VLIW) or Single Instruction Multiple Data (SIMD) capabilities of the AIE-ML.
In this tutorial we will use the AMD AI Engine API to vectorize our code and take advantage of the VLIW and SIMD capabilities of the AIE-ML
Note: This tutorial was created using 2025.1. Tool flow may vary in other versions of the tool.
Note 2: You can use the following project to rebuild a AMD Vitis workspace to follow the steps in this tutorial: https://github.com/xflorentw/AI_Engine_Basic/tree/main/01_Simple_AIE-MLRun make all VERSION=2 to build the workspace before any graph optimization. Kernel Code VectorizationTo vectorize a kernel code we have to use specific C++ construct provided by AMD. There are 2 different types of API provided by AMD: AI Engine Intrinsics and AI Engine API.
The AI Engine Intrinsics are very low level programming API. They are specific to each AI Engine version. This means that if you are writing a code with intrinsics targeting AIE, it will not work on AIE-ML. You can find the documentation about the AIE-ML intrinsics for 2025.1 here:
https://download.amd.com/docnav/aiengine/xilinx2025_1/aiengine_ml_intrinsics/intrinsics/index.html
The AI Engine API are giving an higher level of abstraction compared to the intrinsics. Not only most code (if not using architecture specific features) will work on all different versions of the architecture but also the API are templatized which give a higher Ease of Use when programming. You can find the documentation about the AI Engine API for 2025.1 here:
https://download.amd.com/docnav/aiengine/xilinx2025_1/aiengine_api/aie_api/doc/index.html
AMD recommends using the AI Engine API, so this is what we will use in this tutorial.
This is the current processing the kernel is doing for each sample:
c2.real = c1.real+c1.imag;
c2.imag = c1.real-c1.imag;I thought about treating the samples as real 16-bit integer samples so producing 2 samples from 2 input samples:
out[i] = in[i] + in[i+1];
out[i+1] = in[i] - in[i+1];Then the AIE-ML vector unit is optimized for matrix to matrix operations so the way I thought about optimizing the kernel code was by modifying it to work as a matrix to matrix multiplication.
Looking at the Matrix Multiplication section in the AI Engine API documentation we can see which matrices sized are supported. Looking at the different option for 16-bit on AIE-ML, I though I would use a 4x4x4 matrix multiplication
With the 4x4x4 matrix multiplication, I can replicate the kernel processing with the following operation:
From the AI Engine API documentation, we can see that the matrix multiplication is defined as a class template which has the following template:
aie::mmul< M_Elems, K_Elems, N_Elems, TypeA, TypeB, AccumTag >The different parameters are:
- M_Elems : This is the number of rows in the first matrix. This will be 4 for us.
- K_Elems: This defines the number of columns in the first matrix and the number of rows in the second matrix. This will also be 4 for us.
- N_Elems: This is the number of columns in the second matrix. This will also be 4 for us.
- Type A: is the type of the elements in the first matrix. We will set it to int16.
- Type B: is the type of the elements in the second matrix. We will also set it to int16.
- AccumTag: This is the type of the elements of the accumulator that contains the results to be written in output matrix. This is an optional parameters, we will not set it in this example.
Inside the aie::mmul class there is the mul method which will execute the matrix multiplication operation and which is defined as below:
mul(const VecA & a, const VecB & b)As you can see, this methods expects 2 vectors a and b. They represent the input matrices with row-major data layout.
As these vectors contains all the samples of our matrices, they need to contain 16 elements each (4x4) of type 16-bit integer. I will need a 3 matrix which will get the output of the operation. It will also be a vector of 16 elements of type 16-bit integer. This is how I have declared them in my code:
aie::vector<int16, 16> data_int;
aie::vector<int16, 16> coeff_block;
aie::vector<int16, 16> data_int_o;To read/write from/to the input/output buffers, one way is to use iterators. An iterator is an object that can iterate over data in a buffer and provide access to individual data.
I did not want to change the prototype of the function (so I am changing only the kernel code and not the graph code) so I have kept the input and output datatypes as 16-bit complex integer. Thus I have declared the iterators as 8 sample (cint16) large vectors but we will later consider them as 16 samples (int16) large vectors.
auto data_i = aie::begin_vector<8>(in);
auto data_o = aie::begin_vector<8>(out);Then I am creating a class object matrixMul based on aie::mmul with the matrix sizes (4x4x4) and datatypes (int16) that we need.
aie::mmul<4, 4, 4, int16, int16> matrixMul;One of the matrices of the matrix multiplication will be using constant coefficients, so we can declare the array of coefficient directly in our kernel code.
alignas(aie::vector_decl_align) const int16_t coeff[16] = { 1, 1, 0, 0, 1, -1, 0, 0, 0, 0, 1, 1, 0, 0, 1, -1};You might notice the alignas(aie::vector_decl_align) before the declaration of the array. This aie::vector_decl_align is a global constant value that can be used to align the array to a boundary that works for any vector size. On AIE-ML this will basically aligned the array to a 16 byte boundary in the buffer (this will be in the system memory).
Note: On my first try I actually did not use alignas(aie::vector_decl_align) and this made my design to fail. I will share how I debugged it in the next tutorial
Then we need to load the value into a vector of the AIE-ML so we can feed it to the vector compute unit.
coeff_block = aie::load_v<16>(coeff);Then I am doing the processing in the for loop:
for (unsigned i=0; i<NUM_SAMPLES/8; i++) {
data_int = aie::vector_cast<int16>(*data_i++);
matrixMul.mul(data_int, coeff_block);
data_int_o = matrixMul.to_vector<int16>(0);
*data_o++ = aie::vector_cast<cint16>(data_int_o);
}On the first line inside the loop, I am reading a vector. The vector is initially a vector of 8 samples of 16-bit complex samples which I am casting to a vector of 16 samples of 16-bit integer samples.
Then on the second line, I am calling the method aie::mmul::mul to run the matrix multiplication.
On the third line I am calling the method aie::mmul::to_vector to get the output of the matrix multiplication to a vector.
And finally on the last line, I am casting the vector back to a vector with 8 cint16 samples and writing to the output buffer.
On thing you can note is that we are now doing only 4 loops of the 4 loops instead of 32. This is because for each for loop we are processing 8 16-bit complex samples. We can now see some of the Single Instruction Multiple Data (SIMD) capability of the AIE-ML.
The full final code is attached to this tutorial for reference.
Vectorized Kernel VerificationWhen coding a new kernel, the first thing we want to check is that it is functionally correct (and that the code compile fine). The recommended flow for this is to run x86 compilation and simulation as this is much faster than the ai engine compiler / simulator.
Running the compilation gives a nice compilation complete with no error so at least the coding seems correct
Compilation Complete
(WARNING:0, CRITICAL-WARNING:0, ERROR:0)Then running in x86 simulation we can see that the results are matching the golden results which were provided as part of the sources folder. So our kernel code seems functionally correct.
We can then run the AI Engine compiler to be able to run the AI Engine simulator and get an estimation of the performances of our new kernel. The kernel is now completing the processing in 67 ns. This is a great improvement compared to the 461 ns (nearly 7x) that we we seeing initially (in the previous tutorial).
We can see that the 4th execution of the kernel second (which correspond to the 4th execution of the graph) is now happening after 791 ns. This is about less than 2 kernel executions before the kernel vectorization (the end of the last kernel in the initial code was actually happening after 4, 310 before any graph or kernel code improvement).
Note 2: You can use the following project to rebuild a AMD Vitis workspace to get the final version of the project after the steps mentioned in this tutorial: https://github.com/xflorentw/AI_Engine_Basic/tree/main/01_Simple_AIE-ML Run make all VERSION=3SummaryIn this tutorial, we have seen how to take advantage of the vector processor of the AIE-ML by vectorizing the code using the AI Engine API to get nearly a 7x improvement in the performance of our kernel. I hope this gave you some insight about AIE-ML vectorized programming.
Note: We can even reduce the number of cycles taken by the kernel using some pragmas. We can see also some stalls on the traces so there are some improvements that can be made on the graph level. This is something we will see in a later article.
If you want to learn more about kernel optimization, I highly recommend the following tutorial on the AMD / Xilinx repository:
Disclaimers- AMD, Versal, and Vitis are trademarks or registered trademarks of Advanced Micro Devices, Inc.
- Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.






Comments