Software apps and online services
Introduction: The latest deep learning models combined with state-of-the-art hardware can perform human pose estimation (HPE) in real time. HPE refers to the estimation of a kinematic model of the human body from image data. The Interdisciplinary Center for Artificial Intelligence (ICAI) at the University of Applied Sciences Eastern Switzerland (OST), in collaboration with VRM Switzerland, has adapted the well-known OpenPose network for HPE and made it computationally more efficient. Let's call this network for this project ICAIPose. The current prototype, which uses a multi-camera system with a first-class and computationally intensive deep learning model, runs on multiple graphics processing units (GPUs) with sufficient performance.
For the broad adoption of HPE in the therapeutic environment, a small and cost-effective system is desired. Therefore, the pose-tracking system should run on edge devices.
Aim: In this project, ICAIPose should be implemented on the FPGA edge device Kria KV 260 form AMD-Xilinx. As ICAIPose was designed for GPU, the required effort to run such a given network on an FPGA and the performance impact is of major interest.
Approach: The application requires a camera interface and a deep learning processing unit. To test these hardware parts, a given example project that uses these parts was run on the Kria board first. Then, AMD-Xilinx's Vitis AI is used to compile the ICAIPose network for the Deep-Learning Processor Unit (DPU) on the FPGA with minor adjustments to the network. The included Vitis AI Runtime Engine with its Python API communicates with the DPU via an embedded Linux on the FPGAs microprocessor.
Conclusion: ICAIPose is a very large neural network with more than 100 GOps to process one frame. Nevertheless, on the KV260 a throughput of 8 frames per second could be achieved. The GPU-based NVIDIA Jetson Xavier NX, which costs more than double as the Kria board, achieves a similar frame rate.
The successful implementation of ICAIPose on an edge device with promising performance opens the field for broad applications in the therapeutic environment.
The Vitis AI framework from AMD-Xilinx has been extensively tested and shows its strengths, but also some teething problems. For running deep neural networks on FPGAs, Vitis AI is a framework with a good trade-off between development time and performance. It should be considered before implementing hardware-accelerated algorithms in HDL or HLS.Prerequisite
- Linux host PC with Vitis AI installed
- Knowledge of the Vitis AI workflow
- Internet access for the KV260
The ususal output of a HPE networks are confidence maps for given keypoints of the human pose. For the task of a single HPE, the maximum of the confidence map is found and the corresponding keypoint is assigned.
The camera interface is an important part of the design. With the Kria KV260 Basic Accessory Pack, a small camera is included.
AMD-Xilinx provides an example application for the Kria™ KV260 Vision AI Starter Kit Applications.
The block design of the smart camera application shows that the hardware platform contains everything we need for this project including hardware interface for the camera and the DPU. This example application can be used as a base design to run custom Vitis AI models with the camera.
First, all versions are carefully checked to make sure they match:
- For Vitis AI 1.4 and previous versions the board image for the KV260 is 2020.2
- This requires using the Smartcamera app, which also uses the 2020.2 board image (not the latest version).
- The Vitis AI version of the 2020.2 smartcamera platform is Vitis AI 1.3.0
Follow this instructions to install the smartcamera app on the KV260 (until and with section 5).
Connect the KV260 board to a local network using the ethernet port.
While being connected via UART/JTAG, check the assigned IP address of the ethernet (eth0) port.
The output of that command, would look similar to that:
eth0 Link encap:Ethernet HWaddr 00:0a:35:00:22:01
inet addr:184.108.40.206 Bcast:220.127.116.11 Mask:255.255.255
inet6 addr: fe80::20a:35ff:fe00:2201/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:67 errors:0 dropped:0 overruns:0 frame:0
TX packets:51 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:9478 (9.2 KiB) TX bytes:5806 (5.6 KiB)
in this case the ip-address is
Use this address to connect from a host PC (connected to the same network as the KV260) over the network with
ssh to the KV260.
To run all Vitis AI examples, some further installations have to be done on the KV260. Ensure that the device is connected to the internet.
sudo dnf install packagegroup-petalinux-x11
Set the display environment
sudo dnf install packagegroup-petalinux-vitisai
sudo dnf install packagegroup-petalinux-opencv
sudo dnf install xz
Vitis AI Runtime
sudo wget https://www.xilinx.com/bin/public/openDownload?filename=vitis-ai-runtime-1.3.0.tar.gz
sudo tar -xzvf openDownload\?filename\=vitis-ai-runtime-1.3.0.tar.gz
sudo bash setup.sh
The Vitis AI Runtime (VART) expects the location of a
DPU.xclbin file in the
vart.conf file. But the corresponding
xclbin file is the one from the smartcam app.
vart.conf and the xclbin file with the following commands.
echo "firmware: /lib/firmware/xilinx/kv260-smartcam/kv260-smartcam.xclbin" | sudo tee /etc/vart.conf
sudo cp /lib/firmware/xilinx/kv260-smartcam/kv260-smartcam.xclbin /usr/lib/
sudo mv /usr/lib/kv260-smartcam.xclbin /usr/lib/dpu.xclbin
Before Vitis AI example can be run, the according smartcam application must be loaded (after every boot).
sudo xmutil unloadapp
sudo xmutil loadapp kv260-smartcam
When the KV260-smartcam app is loaded, the camera can be tested with X11-forwarding with the following GStreamer command:
gst-launch-1.0 mediasrcbin media-device=/dev/media0 v4l2src0::io-mode=dmabuf v4l2src0::stride-align=256 ! video/x-raw, width=256, height=256, format=NV12, framerate=30/1 ! videoconvert! ximagesink
If a display is connected via HDMI (1920x1200 in this case), the camera can be tested as well. Please change the width and height parameter according to your connected display
Vitis AI Model Zoo
gst-launch-1.0 mediasrcbin media-device=/dev/media0 v4l2src0::io-mode=dmabuf v4l2src0::stride-align=256 ! video/x-raw, width=1920, height=1200, format=NV12, framerate=30/1 ! kmssink driver-name=xlnx plane-id=39 sync=false fullscreen-overlay=true
In the next project step, the camera system is tested in combination with Vitis AI. With the wide collection of pre-trained neural networks from Vitis AI Model Zoo, an example can be chosen.
Hourglass is a HPE network with following properties:
- Description: Pose Estimation Model with Hourglass
- Input size: 256x256
- Float ops: 10.2G
- Task: pose estimation
- Framework: caffe
- Prune: 'no'
- Newest version: Vitis AI 2.0
The precompiled version of the model for the KV260 has been compiled with Vitis AI 2.0 with the DPU configuration
We use the DPU configuration
B3136. Therefore, the hourglass model has to be recompiled with the
caffe workflow for the corresponding DPU and the correct Vitis AI version 1.3.0 (Docker image:
The DPU fingerprint and the corresponding
arch.json file can be found in the smartcam documentation.
The newly compiled files model can be saved on the KV260.
The test application (
test_video_hourglass) provided by the Vitis AI Library is used to run the model. To do so, use the prebuilt files provided in this project or compile the test application with the cross-compilation system environment on the host PC for the KV260 (follow the Vitis AI instructions).
Download and extract the prebuilt files on the KV260.
tar -xf prebuilt.tar.xz
Go to the hourglass folder
The GStreamer string from the camera interface is used as the input device. With the following command, the program runs with two threads.
./test_video_hourglass hourglass_kv.xmodel "mediasrcbin media-device=/dev/media0 v4l2src0::io-mode=dmabuf v4l2src0::stride-align=256 ! video/x-raw, width=256, height=256, format=NV12, framerate=30/1 ! videoconvert ! appsink" -t 2
Hourglass runs with 30 fps. Note That the limiting factor is the camera and not the neural network.Vitis AI
The main part of the project was to use a neural network designed for a conventional GPU implementation with Tensorflow and try to run it on an FPGA.
ICAIPose is a fairly large network with about 11 million learnable parameters and
103 GOps to process an image.
The original network is composed with following layers:
PReLU activation function
For the usage with Vitis AI, one has to check if all layers of the neural network are supported by Vitis AI (see the corresponding user guide).
All layers but the
PReLU activation function are supported. The "Parametric ReLU" is very similar to the
Leaky ReLU function (see following figure) except that the leak-term is a learnable parameter. Vitis AI supports
Leaky ReLU with a fixed leak-term of 0.1.
Some challenges came with the introduction of the
Leaky ReLU activation function. Unfortunately,
Leaky ReLU is not supported for the Tensorflow 2 (
TF2) workflow (only for Vitis AI 1.3, it is supported by newer versions). Therefore, the Tensorflow 1 (
TF1) workflow is used.
ICAIPose is written and trained in the
TF2 (Keras) framework. A saved model from Keras'
h5 files can be converted into a
TF1 Checkpoint and meta graph with the following code.
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
loaded_model = tf.keras.models.load_model('model.h5')
print ('Keras model information:')
print (' Input names :',loaded_model.inputs)
print (' Output names:',loaded_model.outputs)
tfckpt = 'train/tfchkpt.ckpt'
tf_session = tf.keras.backend.get_session()
# write out tensorflow checkpoint & meta graph
saver = tf.compat.v1.train.Saver()
save_path = saver.save(tf_session,tfckpt)
print (' Checkpoint created :',tfckpt)
Even though Keras is supported by
TF2, a Keras network saved in
TF2 and then converted to
TF1 cannot be processed by Vitis AI.
Therefore, the network is saved in Keras from
Since the model has been previously trained in
TF2 (Keras) the weights must be exported.
#Running in TF2
model = keras.models.load_model("trained_model.h5")
and then imported in
#Running in TF1.15
#Description of the model
After the conversion to the
TF1 files, the normal Vitis AI flow (Freezing, quantization and compilation) can be used.
A test dataset is used for quantization. Save the dataset in the
The code to run ICAIPose is adapted from an example by Mario Bergeron on GitHub. It uses the normal
VART API for python. The script
run_ICAIPose_FPGA.py is used to run the model on the FPGA in a multi-threaded environment.
The following library must be installed to use the script:
sudo pip3 install imutils
Go to the folder
prebuilt/ICAIPose and run the script.
python3 run_ICAIPose_FPGA.py <number of threads> <model_file>
python3 run_ICAIPose_FPGA.py 3 own_network_256_KV260.xmodel
Close the window by pressing letter 'q' on the host PC. The throughput performance is printed to the terminal.
Or with a HDMI Display
python3 run_ICAIPose_FPGA_hdmi.py <number of threads> <model_file> <Display Width> <Display Height>
python3 run_ICAIPose_FPGA_hdmi.py 3 own_network_256_KV260.xmodel 1920 1200
It is recommended to run the application over HDMI, depending on the network connection the throughput of the video stream may appear worse.Results
It is interesting to know how fast the network is running on the FPGA, but it is important to know if some HPE performance is lost due to the quantization.
ICAIPose (256x256, 103 GOps): 8 fps
B3136 DPU and a clock frequency of
300 MHz, a theoretical throughput of
940 GOps/s is given.
Therefore, the results are in the expected range (recall:
103 GOps for one image).
For comparison, the NVIDIA Jetson Xavier NX which is more expensive than a KV260 and has a significant higher theoretical throughput (21 TOps) reached the same throughput of 8 fps.
Human Pose Estimation Performance
This dataset that provides more than 2000 images and the corresponding ideal confidence map is used to test the HPE performance.
The Mean Squared Error (MSE) of the normalized confidence maps is computed by squaring and summing up the difference between every pixel. An example is shown in the following image. The mean of the image on the left is the MSE of the given input.
We can now compare the MSE between the quantized and the float network. As an additional information the MSE of the original network with the
PReLU activation Function is shown.
The MSE of all images:
Quantized INT8: 0.9332
The change from the
PRelu to the
Leaky ReLU activation function increased the network performance even a bit. The quantization has an impact on the
MSE, but a rather small one. The quantized networked performed as well as the unquantized
The Vitis AI framework from AMD-Xilinx was extensively tested and showed its strengths and some teething troubles. Changing the target device from GPU to FPGA was possible without loosing significant performance even though the FPGA board is cheaper. Vitis AI allows to design an efficient deep neuronal network for an FPGA without knowledge in HDL or HLS.
The Kria KV260 Vision AI Starter Kit is a great choice to start with Vitis AI. The provided camera can easily be used within the petalinux environment.ReferencesAcknowledgment
Thanks to the Institute of Microelectronics and Embedded Systems for supporting this challenge as part of a student project.Revision History
- 3/14/2022 - Initial release