MicroZed Chronicles: Vitis Example Application Deep Dive

A deep dive into the Vitis application example to demonstrate the Ultra96 V2 platform was created correctly.

5 years ago • Machine Learning & AI / Robotics / Sensors / Internet of Things / Communication

Last week we completed the creation of the Vitis acceleration platform for the Ultra96 and created an example application to pipe clean the process.

The example application created is fairly simple vector addition. Once compiled, Vitis provides all of the files needed to run the application in a directory sd_card under the hardware build structure.

Within this directory, you will see a boot.bin, kernel image, the vector addition application and the binary container which is loaded in to the programmable logic — everything we need to run the example on the hardware.

Running the application itself is very straightforward, the starting point of which is the copying the files from the SD_Card directory to a SD card and booting the Ultra96 V2.

SD card contents

Once the Ultra96 is booted, we need to change directory so we can access the files on the SD card to run the application.

Change directory to the path below:

/run/media/mmcblk0p1

Now that we are in the required directory, we need to export the location of the Xilinx Run Time before we can run the example.

export XILINX_XRT=/usr

To run the example, we then just define the name of the program and as argv0 provide the name of the XCLBIN.

./test.exe binary_container_1.xclbin

Setup and running of the vector addition example

Let's take a look in a little more detail at what is actually happening in the example.

The example is made from two source files:

vadd.cpp — Contains the host application and runs on the Arm Cortex A53 cores in the MPSoC.
krnl_vadd.cpp — Contains the kernel which is implemented within the programmable logic. This kernel is implemented using High Level Synthesis (HLS).

Host Application

It is the role of the host application to perform the configuration and life cycle management of the kernel.

The life cycle management starts with ensuring the platform and device can be found:

cl::Platform::get(&platforms);
for(size_t i = 0; (i < platforms.size() ) & (found_device == false) ;i++){
    cl::Platform platform = platforms[i];
    std::string platformName = platform.getInfo<CL_PLATFORM_NAME>();
    if ( platformName == "Xilinx"){
        devices.clear();
        platform.getDevices(CL_DEVICE_TYPE_ACCELERATOR, &devices);
    if (devices.size()){
        device = devices[0];
        found_device = true;
        break;
    }
    }
}

Once the platform and device have been located, the next stage is to load in the XCLBIN file.

The first step in this is to create an OpenCL context for the target device and the establish a command queue. The command queue enables communication between the host and the device. This allows the host to issue commands to the OpenCL device for execution.

// Creating Context and Command Queue for selected device
cl::Context context(device);
cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE);
// Load xclbin
std::cout << "Loading: '" << xclbinFilename << "'\n";
std::ifstream bin_file(xclbinFilename, std::ifstream::binary);
bin_file.seekg (0, bin_file.end);
unsigned nb = bin_file.tellg();
bin_file.seekg (0, bin_file.beg);
char *buf = new char [nb];
bin_file.read(buf, nb);
// Creating Program from Binary File
cl::Program::Binaries bins;
bins.push_back({buf,nb});
devices.resize(1);
cl::Program program(context, devices, bins);

Once the command queue has been created, the next step is to load in the XCLBIN file.

Interfacing to the kernel takes place using the OpenCL memory model. Directions are defined with respect to the device e.g. buffer_a and buffer_b are read only by the kernel while the buffer_result is write only.

cl::Buffer buffer_a(context, CL_MEM_READ_ONLY, size_in_bytes);
cl::Buffer buffer_b(context, CL_MEM_READ_ONLY, size_in_bytes);
cl::Buffer buffer_result(context, CL_MEM_WRITE_ONLY, size_in_bytes);

These buffers are allocated in global memory. If you are not familiar, the OpenCL memory structure consists of the following:

Host memory — Accessible only to the host.
Global memory — Accessible to both the host and the kernel, this is the main medium of transferring data between host and kernel.
Constant global memory — Accessible to the host and kernel. However only the host as read write access, for kernels this region is read only.
Local memory — Used by the kernel for computation and storage, not accessible to the host directly.
Private memory — Used by tasks within a kernel, other tasks cannot access the memory area. Again there is no direct host access.

Once the buffers have been created, they need to be mapped so the host application can access the buffers.

int *ptr_a = (int *) q.enqueueMapBuffer (buffer_a , CL_TRUE , CL_MAP_WRITE , 0, size_in_bytes);
int *ptr_b = (int *) q.enqueueMapBuffer (buffer_b , CL_TRUE , CL_MAP_WRITE , 0, size_in_bytes);
int *ptr_result = (int *) q.enqueueMapBuffer (buffer_result , CL_TRUE , CL_MAP_READ , 0, size_in_bytes);

Finally once input data has been set, we are ready to define run the kernel. To do this, we need to move the input data into the buffers, launch the kernel, and then move the output data from the buffer.

// Data will be migrated to kernel space
q.enqueueMigrateMemObjects({buffer_a,buffer_b},0/* 0 means from host*/);
//Launch the Kernel
q.enqueueTask(krnl_vector_add);
// The result of the previous kernel execution will need to be retrieved in
// order to view the results. This call will transfer the data from FPGA to
// source_results vector
q.enqueueMigrateMemObjects({buffer_result},CL_MIGRATE_MEM_OBJECT_HOST);
q.finish();

Once the kernel has been completed, execution of the next step is to clean up the software and de-allocate the buffers.

Kernel Implementation

Examining the kernel implementation, you will see this looks very similar to a normal High Level Synthesis design.

The first thing of interest is the interfacing. The kernel has four inputs, which contain the two input vectors — one result vector, with the final input defining the number of vectors to be processed.

void krnl_vadd(const unsigned int *in1, // Read-Only Vector 1
const unsigned int *in2, // Read-Only Vector 2
unsigned int *out_r,     // Output Result
int size                 // Size in integer
) {
#pragma HLS INTERFACE m_axi port = in1 offset = slave bundle = gmem
#pragma HLS INTERFACE m_axi port = in2 offset = slave bundle = gmem
#pragma HLS INTERFACE m_axi port = out_r offset = slave bundle = gmem
#pragma HLS INTERFACE s_axilite port = in1 bundle = control
#pragma HLS INTERFACE s_axilite port = in2 bundle = control
#pragma HLS INTERFACE s_axilite port = out_r bundle = control
#pragma HLS INTERFACE s_axilite port = size bundle = control
#pragma HLS INTERFACE s_axilite port = return bundle = control

The input and output vector data is defined to allow implementation as AXI memory mapped ports using the HLS INTERFACE pragma, while the IP block control and the size of the vectors are configured to be implemented within a AXI Lite interface.

The body of the code is fairly simple, using nested for loops to read the data inputs, perform the addition and write back the results.

To optimize the kernel for performance in programmable logic, the loops are pipelined to provide an initiation interval of one. That is one clock cycle between being able to process new input data.

for (int i = 0; i < size; i += BUFFER_SIZE) {
    #pragma HLS LOOP_TRIPCOUNT min=c_len max=c_len
    int chunk_size = BUFFER_SIZE;
    //boundary checks
    if ((i + BUFFER_SIZE) > size)
        chunk_size = size - i;
    read1: for (int j = 0; j < chunk_size; j++) {
        #pragma HLS LOOP_TRIPCOUNT min=c_size max=c_size
        #pragma HLS PIPELINE II=1
        v1_buffer[j] = in1[i + j];
    }
    //Burst reading B and calculating C and Burst writing
    // to  Global memory
    vadd_writeC: for (int j = 0; j < chunk_size; j++) {
        #pragma HLS LOOP_TRIPCOUNT min=c_size max=c_size
        #pragma HLS PIPELINE II=1
        //perform vector addition
        out_r[i+j] = v1_buffer[j] + in2[i+j];
    }
}

Now that we have walked through the platform creation and taken a look into the contents of the host and the kernel source code, we should understand a little bit more about the Vitis flow and be able to create our own applications.

Going forward we will look at creating Vitis acceleration applications for a range of applications, since we have seen how simple it is!

See My FPGA / SoC Projects: Adam Taylor on Hackster.io

Get the Code: ATaylorCEngFIET (Adam Taylor)

Access the MicroZed Chronicles Archives with over 300 articles on the FPGA / Zynq / Zynq MpSoC updated weekly at MicroZed Chronicles.

Adam Taylor is an expert in design and development of embedded systems and FPGA’s for several end applications (Space, Defense, Automotive)

MicroZed Chronicles: Vitis Example Application Deep Dive

A deep dive into the Vitis application example to demonstrate the Ultra96 V2 platform was created correctly.

Host Application

Kernel Implementation

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles