In my previous tutorial, I mentioned that my design was failing after my initial code update for vectorization. I thought I would share the steps I used to solve the issue.
If you want to replicate the AMD Vitis workspace to follow the steps in this tutorial, you can use the following design: https://github.com/xflorentw/AI_Engine_Basic/tree/main/01_Simple_AIE-ML
Run make all VERSION=4 to build the workspace.
Please note that I could not exactly recreate the issue in the same way I faced it as it depends on the compiler output so if you are not seeing the issue you are "lucky" with the compiler output ;).
This is the kernel code that I am using:
static int16_t TEST=1;
const int16_t coeff[16] = { 1, 1, 0, 0, 1, -1, 0, 0, 0, 0, 1, 1, 0, 0, 1, -1};
void simple(adf::input_buffer<cint16> & in, adf::output_buffer<cint16> & out) {
aie::vector<int16, 16> data_int;
aie::vector<int16, 16> coeff_block;
aie::vector<int16, 16> data_int_o;
auto data_i = aie::begin_vector<8>(in);
auto data_o = aie::begin_vector<8>(out);
aie::mmul<4, 4, 4, int16, int16> matrixMul;
if(TEST==1) TEST++;
coeff_block = aie::load_v<16>(coeff);
for (unsigned i=0; i<NUM_SAMPLES/8; i++) {
data_int = aie::vector_cast<int16>(*data_i++);
matrixMul.mul(data_int, coeff_block);
data_int_o = matrixMul.to_vector<int16>(0);
*data_o++ = aie::vector_cast<cint16>(data_int_o);
}
}Note that the following line was not part of my initial code, but this is a line I added to try to voluntarily reproduce the issue
if(TEST==1) TEST++;Another thing to note is that in the ai engine compiler settings (aiecompiler.cfg) I had set XLOpt=0. This setting was to get more details on the simulation traces but will have an impact on this issue as we will see.
AI Engine Compilation / Simulation ResultsSo, I had coded my vectorized kernel close to the code I am sharing in the introduction. I have done AI Engine coding for quite a while now and the code was pretty simple so I thought that would just need to compile and if I had not done a silly mistake like forgetting a semicolon it would just work fine. Thus, I compiled the code using the AI Engine compiler. The code was compiling with no error so I thought I was done with my code update.
But then I checked the result from the output text file and I was getting unexpected results:
5 4 2 4
13 16 -2 12
21 28 -6 20
...While the valid results I was getting before the code vectorization was as follow:
0 2 4 6
8 10 12 14
16 18 20 22
...X86 Compilation / Simulation ResultsAs I was not getting functionally correct code, I decided to move back to X86 simulation so I could iterate more quickly on my kernel code to get it functionally correct.
The x86 compilation was successful (no error and no warning) but running x86 simulation, it was crashing with the following output:
INFO: Reading options file './Work/options/x86sim.options'.
sim.out: 2025.1/Vitis/aietools/include/aie_api/aie.hpp:508: vector<aie_dm_resource_remove_t<T>, Elems> aie::load_v(const T *) [Elems = 16U, Resource = aie_dm_resource::none, T = short]: Assertion `detail::check_vector_alignment<Elems>(ptr) && "Insufficient alignment"' failed.
Core dump backtrace:
[...]
#6 ./Work/pthread/sim.out() [0x413498]
aie::vector<aie_dm_resource_remove<short>::type, 16u> aie::load_v<16u, (aie_dm_resource)0, short>(short const*) at 2025.1/Vitis/aietools/include/aie_api/aie.hpp:511:49
(inlined by) simple(adf::io_buffer<cint16, adf::direction::in, adf::io_buffer_config<adf::extents<4294967295u>, adf::locking::sync, adf::addressing::linear, adf::margin<0u>>>&, adf::io_buffer<cint16, adf::direction::out, adf::io_buffer_config<adf::extents<4294967295u>, adf::locking::sync, adf::addressing::linear, adf::margin<0u>>>&) at AI_Engine_Basic/01_Simple_AIE-ML/workspace/aie-ml_component_simple/build/x86sim/../../src/kernels/kernels.cc:28:17
[...]
2025.1/Vitis/aietools/bin/x86simulator: line 479: 3963694 Aborted (core dumped) setsid $X86SIM_PROG
Simulation Failed returning non zeroThis first line was quite clear: the error is due to assertion in the code when calling the aie::load_v API. It was telling me that there was an Insufficient alignment.
Looking at the core dump backtrace it seemed that it was coming from the following line (line 28) in my kernel code:
coeff_block = aie::load_v<16>(coeff);So something was wrong when loading the coefficients value to a vector.
I simply decided to search for "alignment" keyword in the AMD AIE-ML programming guide UG1603 and I found the following
Source: https://docs.amd.com/r/en-US/ug1603-ai-engine-ml-kernel-graph/Alignment
We can see that the vector load and store operations need to be aligned to 256-bit boundary when operating a 256-bit load else the result data might be unexpected. Seems to match what we are seeing.
Analysing the alignment in AI Engine Compilation ResultsThe correct coding to fix this issue is mentioned later in the section in UG1603 but before fixing the issue I just wanted to confirm the mis-alignment from the AI Engine compiler results.
We can see how the variables are mapped in memory in the tile output file. Our kernel first is implemented on tile 6, 0, so we can see the mapping in the following file:
aie-ml_component_simple/build/hw/Work/aie/6_0/Release/6_0.map
What we can see for this file is that the coefficient values are implemented (in the heap) at addresses 0x00074702 to 0x00074721. So, the array is indeed not aligned to 256-bit.
And we see that my dummy variable TEST is implemented at address 0x00074700 and probably "causing" the unalignment of the coeff array. So, it is easy to see how some small code changes can make the code to work "luckily".
One other thing we can try is to change the setting for Xlopt (AI Engine kernel optimization). As mentioned, I set it to 0 to have more visibility on the traces. We can try to change it back to 1, which is the default value.
As you can see in the description, this setting is looking at the heap and our variables are stored in the heap.
After running the compilation, the 6_0.map file looks as follow:
You can see that the coeff array is now aligned to 256-bit so running the AI Engine simulation will probably give the correct result.
We can confirm that xlopt has an impact by looking at the kernel guidance report.
We can see an info message saying that the variable coeff has been automatically aligned for us.
However, personally I do not want my code to rely on this feature which could be a risky thing to do if I hand over my code to somebody else as changing compilation parameters could change the behaviour of the kernel.
Also, my x86 simulation is still failing (because it does not have the xlopt utility to align the array for you) so to be able to iterate quickly on my code, it would be better to fix the issue in all cases.
Alignment in memoryAs mentioned the UG1603 is giving the right coding to get the variables aligned in memory:
Source: https://docs.amd.com/r/en-US/ug1603-ai-engine-ml-kernel-graph/Alignment.
This is how I fixed the code to get the working kernel I shared in my previous tutorial
SummaryIn this tutorial we have seen the importance of memory alignment when working with vectors in AI Engine. But more importantly, I tried to show the value of running X86 simulation for functional verification and debug. There is also the valgrind option which can be used for more advanced investigation. I will maybe show it in a future tutorial (if making more coding mistakes when writing code for another tutorial. I am sure I will ! I am good enough at making coding mistakes ;) )
This tutorial was also a good way to show you the .map file which is generated for all the active AIE-ML and which can contain some useful information.
Disclaimers- AMD, Versal, and Vitis are trademarks or registered trademarks of Advanced Micro Devices, Inc.
- Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.






Comments