The Versal AI Edge Series offers end-to-end acceleration for AI-driven embedded systems, combining programmable logic with processing systems and AI Engines, which are VLIW processors organized in an array optimized for AI inference.
AI Engines offered by Versal adaptive SoCs are present in two different versions. The initial version, AI Engine (AIE), is optimized for DSP and communication applications, while the AI Engine-Machine Learning (AIE-ML) introduces a version optimized for machine learning. As a result, a code designed for AI Engines array (which is present, for example, on VCK190 Board) could not be directly compatible with an AI Engines-ML array (e.g., on VEK280) without some modification.
This tutorial is an example of how to adjust a code for AIE in order to make it compatible also for AIE-ML engines execution. The chosen example design consists of the implementation of a bilinear interpolation kernel, which is a straightforward and fast method to achieve image transformation, such as rescaling, cropping..., being a good compromise between computational cost and image quality.
The bilinear interpolation algorithm is an interpolation method for functions of two variables. It is used not only for image processing, but also for finite element analysis, computer graphics, and much more.
This work is entirely based on the bilinear interpolation AIE design tutorial for the VCK190 platform, available on GitHub on the official Xilinx repository. In this tutorial, this reference design for AIE is adapted for AIE-ML. Indeed, the motivation of this short project is to understand how to adapt an AIE existing design to AIE-ML and what the main differences are between the two architectures.
The codes produced as a result of this project are available in my GitHub repository, in the bilinear interpolation AIE-ML design tutorial.
Some basic concepts to understand the code...Bilinear interpolation is an algorithm commonly used in image processing to enhance image quality after specific operations are performed or to achieve transformations such as rescaling and cropping. A common class of image processing operations is spatial transformations, which redefine the arrangement of pixels on the image plane. These are, for instance, rotations, zooms, and corrections of geometric image defects, such as perspective or radial distortion caused by the lens.
From a mathematical point of view, bilinear interpolation is a method for interpolating functions of two variables using repeated linear interpolations. The algorithm calculates the value of the ‘new’ pixel as a weighted sum of the pixel values of the four nearest neighbors surrounding the calculated position.
Once the value of the four nearest pixels to the computing position has been extracted, from a computational point of view, this algorithm can be implemented using basic operations such as MAC (multiply and accumulate) and MSC (multiply and subtract). The equations are obtained by interpolating first in the x-direction:
And then in the y-direction:
AIE-ML engines can be programmed to perform the calculation efficiently, also taking advantage of the vectorization of multiple operations.
AIE-ML Engine Input and Output Data TypeThis example is available in two versions:
- The first version uses MATLAB® to generate test vectors, as sequences of
int32numbers. Although actual data is single-precision floating-point, it is difficult to express such numbers in text format. To capture full precision, the 32 bits used to represent a floating-point number (sign, exponent, mantissa) are written as equivalentint32values. A similar format is used for files containing output data. - The second version uses MATLAB® to generate test vectors, as sequences of
int16numbers. This idea was experimented with because the 32-bit floating-point vector data type is not directly supported by the AIE-ML processor, but can be emulated via decomposition into multiple multiplications of 16 x 16-bit. As you will see soon, the floating point-based design experiences a performance loss compared to the AIE case (don't worry, this aspect will be explored further at the end!). Therefore, the idea was to test a version using the int16 format, which is much faster at the cost of reduced result accuracy.
The following outlines the main modifications made to make the code compatible with AIE-ML compilation.
In the case of the fp32 data type, the AIE API library calls have been used. AIE API is a portable programming interface for AIE accelerators. It is implemented as a C++ header-only library that provides types and operations that get translated into efficient low-level intrinsics. It also provides higher-level abstractions such as iterators and multi-dimensional arrays.
First, instead of using the keyword auto, vectors and accumulators must be explicitly declared:
/*get data for first x interpolation*/
aie::vector<float, 8> xfrac = (*pInA++).cast_to<float>();
aie::vector<float, 8> p11 = (*pInB++).cast_to<float>();
aie::vector<float, 8> p21 = (*pInC++).cast_to<float>();The function fpmac() is not supported for AIE-ML, as it belongs to the intrinsic AIE APIs, which are strictly dependent on the device's hardware. Therefore, it will be replaced with the AIE API calls aie::mac() and aie::msc(). The latter takes as input the accumulator to which the result of the multiplication is added or subtracted, and the two vectors to be multiplied. Therefore, the code also includes some conversions to transform vectors into accumulators and vice versa.
aie::accum<accfloat, 8> p11_acc;
p11_acc.from_vector(p11);
/*compute first x interpolation*/
aie::accum<accfloat, 8> tempy1 = aie::mac(p11_acc,xfrac,p21);
aie::accum<accfloat, 8> pxy1 = aie::msc(tempy1, xfrac, p11);In the case of the int16 data type, the modifications are very similar. However, instead of using the AIE engine APIs, the intrinsic calls mac() and msc() have been used:
/*get data for first x interpolation*/
aie::vector<int16,8> xfrac = *pInA++;
aie::vector<int16,8> p11 = *pInB++;
aie::vector<int16,8> p21 = *pInC++;
aie::accum<acc32, 8> p11_acc;
p11_acc.from_vector(p11);
/*compute first x interpolation*/
auto tempy1 = mac(p11_acc,xfrac,p21);
auto pxy1 = msc(tempy1,xfrac,p11);Running the ExampleGo to the bilinear interpolation folder in my GitHub repository and follow the instructions to download the files, see the code, and build the bilinear interpolation kernel on your own.
Running the example requires that both MATLAB and AMD Vitis™ tools are installed and configured correctly. After downloading the files from the GitHub repository, cd into the .../bilinear_interpolation/aie/ directory and use the make build process. All the instructions are provided.
The platform used in this tutorial is the VE2302, based on the xcve2302 device, but it is also possible to target the VEK280 board. In any case, you just need to change the PLATFORM variable in the makefile.
Vitis Analyzer is an essential tool for accessing information on compilation, simulation, and implementation of AI Engine graphs. It can be used to obtain a summary of profiling data and to graphically display trace events.
It is interesting to have a look to the array view first, illustrating how the design has been mapped on the available physical resources. From this representation, it is possible to see the tiles that will be used for kernel execution, and also the allocated buffers.
fp32 kernel performances
From the Vitis analyzer tool, it is possible to retrieve information about the performance of the graph. Let's have a look at the AI Engine resource utilisation:
From the "trace" view, it is possible to measure the kernel execution time:
The AI Engine simulator output contains a timestamp for each piece of output data (aiesimulator_output/data/output.txt). Considering the first and the last timestamp, it is possible to get the total elaboration time.
kernel_exe_time = 9463 ns - 5143 ns = 4320 nselaboration_rate = pixel_x_graph/kernel_exe_time = 256 / 4320 ns = 59.3 MP/stotal_elaboration_time = 17690.65 usThroughput = output_data_amount/total_elaboration_time = 4194304 B / 17690.65 us = 237.09 MB/s
int16 kernel performances
The same considerations are repeated for the int16 case. The AI Engine resource utilisation is the same (but note that PLIO width in this case is 32-bit instead of 64-bit):
The "trace" view contains the kernel execution time. It is much shorter compared to the fp32 bit case, as expected:
kernel_exe_time = 1880 ns - 737 ns = 452 nselaboration_rate = pixel_x_graph/kernel_exe_time = 256 / 452 ns = 566.4 MP/stotal_elaboration_time = 1916.66 usThroughput = output_data_amount/total_elaboration_time = 2097152 B / 17690.65 us = 1904.17 MB/s
Considerations
The solution that uses the int16 datatype is much efficient compared with the fp32. This can be explained by having a look at the AIE-ML tile architecture. Indeed, the AI Engine-ML device has no floating-point hardware support: fp32 vector multiplications and accumulation are emulated by the bfloat16 vector datapath, and also the AI Engine API supports floating-point operations through emulation using bfloat16. For this reason, floating-point MACs have a latency that is greater than one cycle, and performances are strongly affected.
Instead, the VEK190 board, which is equipped with AI Engines, directly supports fp32 because of the presence of a real Floating-point Vector Unit Single-precision Datapath.
Have a look at my thesis work on the PoliTo website for a complete demo design based on the bilinear interpolation algorithm on AMD Versal Edge adaptive SoC. The design is a Linux-based system running on AI Engine-ML, PS, and PL, which extends the bilinear interpolation kernel just presented, and adds IPs in programmable logic that interact with the AI Engine-ML array.






Comments