In this tutorial I am still looking at an AI Engine component we have imported in this previous project. We have analyzed the code in the previous tutorial. Now the next step is to try to compile and run it.
When looking at the flow navigator in the AMD Vitis™ Unified when the AI Engine component is selected, we can see that there are 2 options to build and simulate the component: X86Simulation and AIE Simulator/Hardware.
The AMD UG1076 is explaining the difference between the 2 modes:
- The X86 simulator is a fast functional simulator. The x86 simulation does not provide timing, resource, or performance information. So we want to use to ensure that our kernels and graphs have been coded correctly to get the right functional results before moving to AI Engine simulator.
- The AI Engine simulator is a cycle-approximate simulator which models the timing and resources of the AI Engine array, while using transaction-level SystemC models for the NoC, and DDR memory. The AI Engine simulator do provide timing, resource and performance information. It will of course be slower than the x86 simulation.Hardware build is in the same section as AI Engine simulator as the simulator will use the same output from the compiler as what will be used in Hardware.
Now that we now the difference between the 2 options, let's run through them and see what we get.
X86 simulatorWhen clicking on Build under X86 Simulation in the flow navigator in Vitis IDE we can see that the compiler is running the following command
v++ -c --mode aie --target x86sim --config <workspace>/aie_component_simple/aiecompiler.cfg --platform <workspace>/TE0950_basic_accel_2025_1_1_0/export/TE0950_basic_accel_2025_1_1_0/TE0950_basic_accel_2025_1_1_0.xpfm --work_dir <workspace>/aie_component_simple/build/x86sim/./Work <workspace>/aie_component_simple/src/project.cpp
- v++ -c --mode aie --target x86sim Indicates compilation of an AI Engine component for 86 functional simulator.
- --config specifies a config file containing options for the compilation
- --platform specifies the target platform, in this case the custom platform we have build in this previous project. Note that if you do not have a platform yet you could use the part directly using the --part option.
- --work_dir specifies the work directory to use for building the AI Engine component.
- The last parameter is the top level file for the application
The main output file from this compilation is a libadf.a file. One other nice output is the graph report which is showing a visualization of the graph. The see this, click on Graph under Reports in the flow navigator
We can see the representation of the graph, similar to what I tried to draw in the previous tutorial, with one input to the graph feeding the kernel first which then is feeding the kernel second and the kernel second is then feeding the output of the graph.
One thing we can also see from this representation is the name of text files associated with the graph input and output, respectively data/input.txt and data/output.txt.
This is defined in the graph code (project.h), when declaring the PLIOs:
in = input_plio::create(plio_32_bits, "data/input.txt");
out = output_plio::create(plio_32_bits, "data/output.txt");
These files will only be used in simulation allowing the AI Engine component to be tested as standalone without the need for an external test bench. The input file data/input.txt is provided to the simulator. It was imported with the example. Looking briefly at the content:
0 1
2 3
4 5
6 7
8 9
[...]
We can see that there are 2 numbers per lines. This is because the PLIOs are defined as 32-bit wide and the input samples are 16-bit complex integer. Thus each line represent, in this case, a single sample (with the real and imaginary part).
The output.txt file will be generated by the simulation tool.
Let's try to run the x86 simulation tool by clicking on Run under X86 SIMULATION in the Vitis IDE flow navigator. We can see that the tool is running the following command:
x86simulator --pkg-dir=./Work --i=../../
- --pkg-dir points to the work directory from the AI Engine compilation
- --i indicates where are the input files
What we get from this simulation is an output file output.txt which is generated under x86sim/x86simulator_output
The content of the file is very similar to the input file (apart from the change in the values). We still have 2 numbers per line as the output PLIO width is also set to 32-bits and the output data type is still 16-bit complex integer.
In the example, a reference file, data/golden.txt is provided with the expected output. We can see that the 2 files match thus the code is functionally correct. We can move to AI Engine simulation.
AI Engine SimulatorWhen clicking on Build under AIE in the flow navigator in Vitis IDE we can see that the compiler is running the following command
v++ -c --mode aie --target hw --config <workspace>/aie_component_simple/aiecompiler.cfg --platform <workspace>/TE0950_basic_accel_2025_1_1_0/export/TE0950_basic_accel_2025_1_1_0/TE0950_basic_accel_2025_1_1_0.xpfm --work_dir <workspace>/aie_component_simple/build/hw/./Work <workspace>/aie_component_simple/src/project.cpp
We can see that the only different with targeting x86simulation is that the --target option is set to hw.
One the compilation is complete, we can see that we have more options available under REPORTS for the AIE SIMULATOR/HARDWARE build.
We can start by looking at the graph view to compare to what we were getting for x86 simulation. We can see that the representation is very similar, we still have the 2 kernels connected in chain, but we have also buffers which are appearing:
One of the key change with the target with x86 simulation is that we have now the notion of the HW. In x86 simulation target, the kernels where just seen like functions running on the processor.
Now this is implemented targeting the actual HW. The PLIOs are coming as streams and we have seen that the input and output of the kernels are set to buffer interfaces. This is why these buffers are implemented.
Note the 2 names on each of the buffers connected to the PLIOs (buf0/buf0d and buf2/buf2d). This means that the buffers are double buffers. This means that the kernel will be able to work on a ping buffer while the PLIO will still be writing or reading to/from the pong buffer. Thus once the kernel completes its iteration, it will not have to wait for the next set of samples to start the next iteration.
The buffer between the 2 kernels has a single name (buf1), which means that this is a single buffer (no ping/pong). That means that the kernel will not operate at the same time but one after the other.
To understand why the tool is only implementing a single buffer, we can change the view to Tile View at the top
We we see is that the kernels are running on the same tile, a.k.a. on the same AI Engine, which is located on the 9th column and 1st row. Because they are on the same tile they will run sequentially, this is why there is no need for a double buffer.
And the reason why the tool is locating them on the same tile is because of the following lines in the graph code that we have seen in the previous tutorial:
runtime<ratio>(first) = 0.1;
runtime<ratio>(second) = 0.1;
As
we mentioned, because the runtime ratio to 10%, it tells the compiler that the kernel can be combined on a tile as they will only need 20% of the AI Engine compile time. And because the tool see a synergy between the 2 kernels (they use the same buffer for sharing data), it decides to combine them.
One other thing that we can see from this tile view is that all the buffers are located on a different tile. As this is a neighboring tile, the AI Engine can access it directly without any performance penalty. With that said this is a compiler choice but we could have the design more power efficient by placing these buffers on the same tile as the kernels, so the tile on the top could be clock gated. This placement can easily be done with some placement constraints.
Another interesting reports is the array view (REPORTS > Array)
This is very similar to the graph report with the array view. But it gives us another information with the green arrows: what streams are used to transmit the data and how they go through the interconnects.
In some cases, this could be a useful information to understand latency inside the array or analyze bottlenecks or routing issues.
I will keep the analysis of the other reports for another tutorial, when we can make a good use of them.
We can now run the simulation (click Run under AIE SIMMULATOR/ HARDWARE). We can see that the tool is running the following command, which has the same options as the x86 simulation:
aiesimulator --pkg-dir=./Work '-i ../..'
Once the simulation complete, we can also find an output file under hw/aiesimulator_output/data/output.txt. The content is slightly different than the output from the x86 simulator.
T 1510400 ps
0 2
T 1516800 ps
4 6
T 1523200 ps
8 10
T 1529600 ps
12 14
[...]
T 1708800 ps
TLAST
30 28
First we can see that there are timestamps (line starting with T) before the samples. As the AIE simulator is cycle approximate, this can give us some estimation of the latency and throughput of our design. For example we can see that the first sample is present at the output after 1.5104 us.
The other things that we can see are the TLAST on lines 64, 129, 194 and 259. They indicates the end of a graph iteration (each time we are emptying the output buffer). There are 4 of them as we are running 4 iterations.
SummaryI hope this article gave you some good information about the 2 compilation and simulation for AI Engine.
Disclaimers- AMD, Versal, and Vitis are trademarks or registered trademarks of Advanced Micro Devices, Inc.
- Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
Comments