MicroZed Chronicles: HLS Advanced Image Processing IP and PYNQ

HLS image processing verification using PYNQ.

4 years ago • Robotics / Machine Learning & AI / Internet of Things / Drones

Last week, we looked at how we could create the basics of a High Level Synthesis image processing IP. This week, we are going to look at a much more complex IP block, and this one will have it all HLS, PYNQ, and image processing.

You may recall I am a big fan of using PYNQ for testing IP block performance in hardware.

PYNQ allows us to quickly and easily, add minimal off the shelf IP around our custom IP block to begin testing and exercising the IP block in Jupyter.

The IP block we are going to create will be able to do the following:

Read in an image from DDR memory.
Output the image over the AXI Stream to create a test pattern that can change its image.

This ability to load in a test pattern using PYNQ and inject it into a AXI Stream is very useful if we wish to ensure downstream processing algorithms are correct.

It is even better if we use PYNQ to capture processed images so we can see the results of the image processing algorithms.

Last week, we looked at how we could create a simple AXI Stream input to AXI Stream output.

For this example, we will need the following interfaces:

AXI Stream Out — image output
AXI Memory Mapped to be able to access DDR memory
AXI Lite Interface -—control interface

AXI Memory Mapped interfaces are easy to implement in our HLS designs using the memcpy() command.

memcpy() enables us to transfer data from one location to another. When we use this command in our HLS design, we can make the source or destination external to the module if we wish to access external memory.

To complete the interface we use a pragma on the external interface to instantiate a AXI memory mapped interface which uses burst transfers.

#pragma HLS INTERFACE m_axi depth=640 port=image offset=slave

Of course, in this example we are going to be reading, from DDR memory.

Along with defining the port image as a AXI memory map, we also provide additional information which helps with the implementation configuration. This includes the depth, which is used in simulation to determine the number of transactions performed.

In this case, I want to read in a line at a time so I set the depth to 640, which is the number of pixels on a line.

What we need to be able to do is define where in the address space to perform the transfer from. There are a few options available to do this — we can have an external port or a register in the AXI Lite slave interface.

Obtaining the physical address from the AXI Lite Slave interface is the choice I made for this IP block, as that way we can update it easily using PYNQ.

The main body of the code is pretty simple to create. We read from the buffer of pixels we just cross loaded from the DDR and output them over a AXI Stream.

Header

#include  "hls_video.h"
#include <ap_fixed.h>
#include "string.h"
#define MAX_WIDTH  640
#define MAX_HEIGHT 480
#define WIDTH 16
typedef hls::stream<ap_axiu<WIDTH,1,1,1> > axis;
typedef ap_axiu<WIDTH,1,1,1> VIDEO_COMP;
void tpg(axis& OUTPUT_STREAM, int lines, int *image );

Source

void tpg(axis& OUTPUT_STREAM, int lines, int* image ){
#pragma HLS INTERFACE m_axi depth=640 port=image offset=slave
#pragma HLS INTERFACE axis register both port=OUTPUT_STREAM
#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INTERFACE s_axilite port=lines
#define line_size 640
#define numb_lines 512
VIDEO_COMP tpg_gen;
int i = 0;
int y = 0;
int x = 0;
int frm_lines =0;
int frame[line_size];
outer_loop:for (y =0; y<lines; y++){
    memcpy(frame,image,line_size*sizeof(int));
    tpg_label0:for (x =0; x <  line_size; x++) {
        if (y == 0 && x == 0 ){
            tpg_gen.user = 1;
            tpg_gen.data = frame[x];
        }
        else{
            if (x == 639 ){
                tpg_gen.last = 1;
                tpg_gen.data = frame[x];
            }
            else{
                tpg_gen.last = 0;
                tpg_gen.user = 0;
                tpg_gen.data = frame[x];
            }
        }
        OUTPUT_STREAM.write(tpg_gen);
    }
  }
}

Test Bench

int main (int argc, char** argv) {
IplImage* src;
IplImage* dst;
axis  dst_axi;
int y;
int data[640];

dst = cvCreateImage(cvSize(640,512),IPL_DEPTH_16U, 1);
for (y =0;y<640;y++){
    data[y] = y;
}

tpg(dst_axi,640, data); //, 1280, 720);
AXIvideo2IplImage(dst_axi, dst);
cvSaveImage("op.bmp", dst);
cvReleaseImage(&dst);
}

Once the IP block has been created, we can include this in a PYNQ overlay for our target board.

To be able to capture and view the test pattern image in PYNQ, we need to use the Color Convert and Pixel Pack IP blocks along with VDMA to in our overlay. If you want to know more about how to create a PYNQ overlay, check out my project here.

Creating an overlay in this manner allows us to verify that our HLS IP is performing as we expected in the hardware.

This means, not only do we want to be able to load the image into the TPG using PYNQ, we also want to be able to view the output.

Of course, later on we can use the same approach with a updated overlay to view the results of the down stream processing.

To be able to set up the image, we will be using the PYNQ Xlnk library and its ability to allocate contiguous memory for buffers etc. We can do this using the code below:

from pynq import Xlnk
memory = Xlnk()
m1 = memory.cma_array(shape=(640,),dtype=np.uint32)
m1_addr = m1.physical_address
hex(m1_addr)

This will allocate an memory for a NP one dimensional array of 640 elements each of 32-bit width.

As our HLS IP block needs to know the physical address of the allocated contiguous memory to access it, we use the .physical_address option.

We can then configure our HLS TPG block with the physical address of the allocated memory.

tpg.write(0x18,m1_addr)
result=tpg.read(0x18);
hex(result)

You can find the slave AXI address offset in the generated driver files for the IP block in Vivado HLS.

With this, we are able to configure the TPG IP block and populate the NP array to be picked up.

In this simple instance, we set a gradient over the NP array

for i in range(640):
    m1[i] = 640-i

We can then run the TPG and check the output image using the code in our Jupyter script:

tpg = overlay.tpg_0
tpg.write(0x00,0x00)
tpg.write(0x10,lines)
tpg.write(0x00,0x81)
frame_camera = cam_vdma.readchannel.readframe()
frame_color=cv2.cvtColor(frame_camera,cv2.COLOR_BGR2RGB)
pixels = np.array(frame_color)
plt.imshow(pixels)
plt.show()

When I run this in Jupyter I see a image as below which is what I expect, now I can get on with creating the rest of the image processing chain such that I can also capture its results in PYNQ.

This blog shows just how easily we can create HLS IP, integrate it within a PYNQ overlay, and then use PYNQ to help prove the IP is performing as we expected before moving on with the verification of more complex algorithms.

See My FPGA / SoC Projects: Adam Taylor on Hackster.io

Get the Code: ATaylorCEngFIET (Adam Taylor)

Access the MicroZed Chronicles Archives with over 300 articles on the FPGA / Zynq / Zynq MpSoC updated weekly at MicroZed Chronicles.

Adam Taylor is an expert in design and development of embedded systems and FPGA’s for several end applications (Space, Defense, Automotive)