A few weeks ago we looked at the Xilinx Deep Neural Network Development Kit and the DNNDK framework.
In this blog we are going to have a deep dive look at the element which is at the heart of the DNNDK — that is the Deep Learning Processor Unit, or the DPU, as it is commonly called.
Using the DPU with DNNDK enables us to implement Convolution Neural Networks (CNN) in our Zynq and Zynq MPSoC Solutions.
The DPU is instantiated in the programmable logic, and requires connections to both the processor and the external memory. The external memory stores both the instructions and images for classification, while the processor responds to interrupts from the DPU to synchronize operation.
From an interfacing point of view, the DPU is very simple consisting of multiple AXI interfaces, interrupts, clocks, and resets.
- Master DPU Instruction Interface (32 bits)
- Two Master DPU Data Interfaces (128 bits)
- Slave DPU Interface (32 bits)
Aside from the slave AXI clock, the DPU core uses two clocks: the master AXI interface clock (m_axi_dpu_clk) and a clock twice this frequency (dpu_2x_clk). To achieve timing closure, these clocks need to be synchronized; therefore, a clock wizard should be used to generate both clocks.
To ensure the maximum performance, the master AXI clock should be set to 333MHz which is the maximum clock rate for AXI Interfaces. Of course, this means the dpu_2x_clk requires clocking at 666 MHz. Ensure the matched routing option is enabled.
Once the AXI interfaces are connected, we can then assign the memory addresses. To ensure we can work with the DNNDK, we need to assign at least 16 MB of memory to the DPU.
To work with the DNNDK, the first DPU interrupt must be connected to IRQ 10. This means it needs connecting to IRQ1[7:0] bit 2. We can use concatenate and constant blocks to ensure the correct interrupt is used.
With the design now connected into the processing system, we can focus a little more on the configuration of the IP core itself.
The first thing we need to decide is the number of DPUs we wish the DPU IP to contain — we can have between one and three cores in our solution.
The second is the actual architecture of the cores. There are eight available architectures. The architecture name Bxxx actually defines the peak operations per clock cycle. To provide a range of peak operations, the different core architectures have a have different levels of pixel, input, and output parallelism.
Selecting the DSP cascade length is as always a trade-off between the resource utilization and timing performance. Larger cascade lengths use less logic but will offer worse timing performance, while lower cascade lengths use less resource yet offer better timing performance. Using higher cascade lengths is therefore more useful in smaller devices where the logic resources are not available.
The final DSP option, low or high DSP usage, relates to how the DPU IP core implements DSP elements.
- Low — DSP are used for multiplication only
- High — DSP elements are used for multiplication and accumulation
Again, using the low setting is for smaller devices which offer limited resources.
The final option is whether we desire UltraRAM to be used in the DPU IP. This is not available on all devices, but when it is, it can be used in place of BRAM.
Once all this is configured as desired, we can implement the design. When I implemented the above design, the utilization was as shown below:
If you want to understand a little more about the DPU, take a look at the Technical Reference Design available freely here.
Now that we have a Vivado bit stream, we need to integrate it with the reset of the DNNDK stack and start using the solution for our CNN application.
Keep on eye on my Hackster project — there will be an in-depth tutorial appearing there soon!
See My FPGA / SoC Projects: Adam Taylor on Hackster.io
Get the Code: ATaylorCEngFIET (Adam Taylor)
Access the MicroZed Chronicles Archives with over 260 articles on the Zynq / Zynq MpSoC updated weekly at MicroZed Chronicles.