YOLOv11 is one of heavily awaited neural network to make compatible to AMD-Xilinx Vitis AI and deploy on FPGA hardware (edge AI hardware). At this article we are describing about "how to make the YOLOv11n" compatible to Vitis AI and DPU for AMD-Xilinx FPGA board based inference.
This article can be followed for AMD-Xilinx Kria and MPSoC FPGA boards, like, ZCU104, ZCU102 and also Versal boards, VCK190. In this article, we are taking yolov11n from this repo.
Here are the major steps performed to make it compatible:
- Replaced SiLU/Swish activation function by supported layer Hardswish.
- Replaced torch.chunk() by a convolutional layer by assigning weight according to chunk function.
- Put the post processing happening (unsupported layers in forward function) in head of model during evaluation/inference out of model.
- Tried to replace torch.matmul() by custom matmul() function but it took too much time, inference was same.
- Tried to replace split function by tensor slicing but its not supported: nndct_strided_slice can't be assigned to DPU.
- Replaced split function using convolutional layer, but still there is permute operation which is in CPU:
- But rearranged other operation after split function to make it supportable i.e. after split do not use slicing and flatten() do further operation according to the shape of tensor after split and repalced matmul by mul.
- Tried to replace softmax function by custom softmax function but due to permute, pow, element wise div, this softmax layer is still in CPU.
- So, replaced softmax with Hardsigmoid which performs sum operation element-wise not dimension of tensor, we used this because it didn't affect so much.
After replacing matmul and softmax:
Feature map from default attention function:
Feature map from DPU supported attention function:
After making all the layers supportable in DPU:
We check the model first by inspecting, so after all supportable in DPU:
GPU model Result comparison:
Before change on model architecture:
After:
In Quantization:
- During deploying the quantized model, the batch size must be 1 and also one forward pass must be happen through the quantized model before deploying. Otherwise it will throws an error:
In Compilations:
- During compilation of yolov11 after quantization of model, number of DPU subgraph must 1, which indicates that the all the layers of model can run in DPU, by in this case, there 9 DPU subgraph due to Mul function associated with Hardswish activaition function:
- So replaced hardswish with hardsigmoid, retrained the model and then quantized and compiled successfully:
Now the yolov11.xmodel can be used for FPGA board inference in Python or C++ with VART API and framework independent logic.
Here is the performance measurement test we have performed in Yolov11:
While our yolov11 nano based on 2.9M parameters.
We will share the video demo of YOLOv11 soon.
One can contact us, LogicTronix, for the quantization/compilation of any neural network and inference in edge AI FPGA platforms.
Thanks for going through this article!
Kudos to our team, Mohan Lal Shrestha and Dikesh Shakya Banda for creating this detailed and insightful article!
For any queries and support, please write at info@logictronix.com
LogicTronix is AMD-Xilinx Partner for FPGA Design and ML Acceleration!







Comments