HLS Overlay - Vivado 2020.2
HW - Creating Block Diagram
HLS Flow Guide
RTL Flow Guide
SW Flow Guide
Code HLS [C++
Code Jupyter Notebook [Python
Stream Multiplier Stream IP
Output HLS overlay

Published March 22, 2022

Super96s Cluster - Part 2

Creating a series of modules allowing you to connect for the first time a PYNQ acceleration distribution network for edge devices U96.

AdvancedFull instructions provided2 hours199

Things used in this project

Hardware components

Tria Technologies Ultra96-V2

Software apps and online services

AMD Vivado Design Suite

AMD Vivado Design Suite HLx Editions

Jupyter Notebook

Story

HLS Overlay - Vivado 2020.2

-----------------------------------------------------------------------------------------------------

Schematic: https://www.avnet.com/opasdata/d120001/medias/docus/193/Ultra96-V2%20Rev1%20Schematic.pdf

Board files: https://github.com/Avnet/bdf

-----------------------------------------------------------------------------------------------------

Objective

The objective of this module is to explore, create and apply an HLS IP kernel and inject that IP into the PYNQ framework. PYNQ is a framework that allows the user to interface with hardware functions & foundations, from high-level languages such as python to the FPGA. We are creating a baseline project which allows users to interface both with DMA and custom IP. Understanding the pathway of the IP process will help understand Jupyter notebook project. This project in other words is a placeholder for more customizable IP you create. The base module is a multiple stream function below:

A * B = C

The design involves 2 axi-interconnects, one managing the axi-master for both DMA and custom IP, the second converting memory to stream (m2ss) --> (s2mm). The reason being is because we are using streaming data allowing the raw data to be mapped to the stream, and inversely once the data has been outputted. We will cover this module as we continue below.

The software (PYNQ) API extracts the setup drivers beautifully for this making the process easy and simple to use. First, you create the lanes and pathway for the data to connect (Send & Receive) then wait for data to enter and leave. Refer to resources resize_example_pl:

[] https://github.com/Xilinx/PYNQ-HelloWorld/blob/master/pynq_helloworld/notebooks/edge/resizer_pl.ipynb) resize_pl_example

HW - Creating Block Diagram

From a high level, there are 2 inputs with 1 output. The goal is for the hardware is to allow the data to streamline through the design. The configuration is crucial in order to learn best practices of ultrafast Methodology. Understanding board/device planning, design creation, implementation, and design closures. The diagram hopefully consolidates the items mentioned earlier.

The process here illustrates the functions that are required for the application to execute. The most challenging piece to this puzzle is how to insert the custom IP in addition to verifying the compatibility with the design protocols.

Using an existing model (the resize_ip), there is a clear correlation between the architecture and procedure of events. Refer to the PYNQ documentation to inquire more.

-----------------------------------------------------------------------------------------------------

HLS Flow Guide

This project assumes you have the board files downloaded and pointed
Create a project (Vitis HLS/ Vivado HLS)
Name project and directory workspaceDo not add Design or Testbench files. Choose a solution name and select the part as shown below. Select “Finish”.

Under Part selection browse for corresponding board [Ultra96_v2]

Right-click under project and to new sources.

Write your code here, the example below demonstrates a stream multiply C++ function for multiplication

Synthesize the design and wait for reports to generate

Navigate to the solution's menu and select "Export RTL"

-----------------------------------------------------------------------------------------------------

RTL Flow Guide

It's a powerful step allowing the software to create the hardware, this accelerates design process tremendously, either if you are a software engineer or new to RTL design. Touching hardware can be difficult even for veteran engineers, there is always a route, pin, placement that has to be touched up. Using C++ you can alleviate that uncertainty. Now let's stitch in the IP to the project you will make now.

Open Vivado 2020.2 (or higher) and select "Create Project"
Name project and directory workspace
Under Part, selection browse for corresponding board [Ultra96_v2]

Create a Block Design

Add the new IP, and press OK

Now when you search for your multiply IP, it can be found

Run Block Automation & apply Board Presets
Double Click UltraScale+ ZYNQ block and apply some modification, enable HP ports

Insert 2 axi interconnects on either side of the middle of our black box

Confirm the Processor has been reset & hooked up to the design

Add a DMA block and verify the functions below. Note - Has to be 32 bits for multiply block to work successfully

Add your new IP mult_constant_hls block!

Refer to the block and ensure the routes on the block design are verbatim

Create a hierarchy by holding control and selecting each block, right-click and create hier_0

Axi-interconnect 1 (left side) controls the block configurations | Axi-interconnect 2 (right) follows the data path (Click the right arrow to see the difference)

1 / 2

Refer to Address Editor and assign all - this maps memory addresses to allocated IP in the design

Verify design and confirm it has verified

Final Block Design

Congratulations you have successfully created the HW portion of the project🥳

-----------------------------------------------------------------------------------------------------

SW Flow Guide

Code HLS [C++]

#include "ap_axi_sdata.h"
#include "hls_stream.h"
typedef ap_axis<32,0,0,0> pkt_t;
void mult_constant_hls(
        hls::stream< pkt_t > &din,
        hls::stream< pkt_t > &dout,
        ap_int<32> multiplier) {
    #pragma HLS INTERFACE s_axilite port=multiplier
    #pragma HLS INTERFACE ap_ctrl_none port=return
    #pragma HLS INTERFACE axis port=din
    #pragma HLS INTERFACE axis port=dout
    pkt_t pkt;
    din.read(pkt);
    pkt.data *= multiplier;
    dout.write(pkt);
}

Code Jupyter Notebook [Python]

Imports

from pynq import Overlay
import time
from pynq import allocate
from pynq.lib.dma import DMA
import numpy as np
import PIL
# from PIL import Image

from IPython.display import Image
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from pynq import allocate, Overlay

Load Bitstream and Overlays

# Load Bitstream (HW/HLS/Vitis)
# overlay = Overlay('design_5_mult.bit')
overlay = Overlay('design_5_mult.bit')
overlay?

#DMA Initialize API
dma = overlay.hier_0.axi_dma_0

#boxfilter
# boxF = overlay.hier_0.boxfilter_accel_0

#Multiplier IP 
mult = overlay.hier_0.mult_constant_hls_0

Stream Multiplier Stream IP

The initialization of the blocks (dma & mult) are now operational. Below we are looking at the register map of a certain unit (e.g. mult) and now creating an input buffer & output buffer. Providing the IP and indicating the multiplier of X Value (in this case you are multiplying by a factor of 5 - the array will be multiplied by 5), you can now create an array for the user, and thus because you created the stream link. Think of this as a hardware operation where you "hardcode" in hardware.

1 / 5

#Find the register mapped input
mult.register_map
RegisterMap {
  multiplier = Register(multiplier=0)\
}
#Create an input & output buffer - This creates the route lane to become active in the HW  
in_buffer = allocate(shape=(6,), dtype=np.uint8)
out_buffer = allocate(shape=(6,), dtype=np.uint8)
#Assign a multiplier value of 5 as an example
mult.register_map.multiplier = 5
#Create a basic Array
for i in range(6):
    in_buffer [i] = i
#Print the Buffer
in_buffer
PynqBuffer([0, 1, 2, 3, 4, 5], dtype=uint8)
#Allow the magic to occur - Sends data to multplier kernel and then captures the receive - Waits for transactions to occur and complete
dma.sendchannel.transfer(in_buffer)
dma.recvchannel.transfer(out_buffer)
dma.sendchannel.wait()
dma.recvchannel.wait()
#Print new Values 
out_buffer
#Prints New Values from Kernel HLS Block
PynqBuffer([ 0,  5, 10, 15, 20, 25], dtype=uint8)

Output HLS overlay

Buffer = [ 0, 1, 2, 3, 4, 5]

Sent to DMA, IP/Kenrel Operational

newBuffer = [0, 5, 10, 15, 20, 25]

HLS Overlay

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/javascript": [
       "\n",
       "try {\n",
       "require(['notebook/js/codecell'], function(codecell) {\n",
       "  codecell.CodeCell.options_default.highlight_modes[\n",
       "      'magic_text/x-csrc'] = {'reg':[/^%%microblaze/]};\n",
       "  Jupyter.notebook.events.one('kernel_ready.Kernel', function(){\n",
       "      Jupyter.notebook.get_cells().map(function(cell){\n",
       "          if (cell.cell_type == 'code'){ cell.auto_highlight(); } }) ;\n",
       "  });\n",
       "});\n",
       "} catch (e) {};\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/javascript": [
       "\n",
       "try {\n",
       "require(['notebook/js/codecell'], function(codecell) {\n",
       "  codecell.CodeCell.options_default.highlight_modes[\n",
       "      'magic_text/x-csrc'] = {'reg':[/^%%pybind11/]};\n",
       "  Jupyter.notebook.events.one('kernel_ready.Kernel', function(){\n",
       "      Jupyter.notebook.get_cells().map(function(cell){\n",
       "          if (cell.cell_type == 'code'){ cell.auto_highlight(); } }) ;\n",
       "  });\n",
       "});\n",
       "} catch (e) {};\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from pynq import Overlay\n",
    "import time\n",
    "from pynq import allocate\n",
    "from pynq.lib.dma import DMA\n",
    "import numpy as np\n",
    "import PIL\n",
    "# from PIL import Image\n",
    "from IPython.display import Image\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from PIL import Image\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "from pynq import allocate, Overlay"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Bitstream and Overlays"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load Bitstream (HW/HLS/Vitis)\n",
    "# overlay = Overlay('design_5_mult.bit')\n",
    "overlay = Overlay('design_5_mult.bit')\n",
    "overlay?\n",
    "\n",
    "#DMA Initialize API\n",
    "dma = overlay.hier_0.axi_dma_0\n",
    "\n",
    "#boxfilter\n",
    "# boxF = overlay.hier_0.boxfilter_accel_0\n",
    "\n",
    "#Multiplier IP \n",
    "mult = overlay.hier_0.mult_constant_hls_0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stream Multiplier Stream IP "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The initialization of the blocks (aka dam | mult) are now operational. Below we are looking at the register map of a certain unti (e.g. mult) and now creating an input buffer & output buffer. Providing the IP and indicating the multiplier of X Value, you can now create an array to the user, and thus because you created the stream link. Think of this as a hardware operation where you \"hardcode\" in hardware. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RegisterMap {\n",
       "  multiplier = Register(multiplier=0)\n",
       "}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Find the register mapped input\n",
    "mult.register_map"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Create an input & output buffer - This creates the route lane to become active in the HW  \n",
    "in_buffer = allocate(shape=(6,), dtype=np.uint8)\n",
    "out_buffer = allocate(shape=(6,), dtype=np.uint8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Assign a multiplier value of 5 as an example\n",
    "mult.register_map.multiplier = 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Create a basic Array\n",
    "for i in range(6):\n",
    "    in_buffer [i] = i"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PynqBuffer([0, 1, 2, 3, 4, 5], dtype=uint8)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Print the Buffer\n",
    "in_buffer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Allow the magic to occur - Sends data to multplier kernel and then captures the receive - Waits for transactions to occur and complete\n",
    "dma.sendchannel.transfer(in_buffer)\n",
    "dma.recvchannel.transfer(out_buffer)\n",
    "dma.sendchannel.wait()\n",
    "dma.recvchannel.wait()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PynqBuffer([ 0,  5, 10, 15, 20, 25], dtype=uint8)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Print new Values \n",
    "out_buffer"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}