Introduction: The latest deep learning models combined with state-of-the-art hardware can perform human pose estimation (HPE) in real time. HPE refers to the estimation of a kinematic model of the human body from image data. The Interdisciplinary Center for Artificial Intelligence (ICAI) at the University of Applied Sciences Eastern Switzerland (OST), in collaboration with VRM Switzerland, has adapted the well-known OpenPose network for HPE and made it computationally more efficient. Let's call this network for this project ICAIPose. The current prototype, which uses a multi-camera system with a first-class and computationally intensive deep learning model, runs on multiple graphics processing units (GPUs) with sufficient performance.
For the broad adoption of HPE in the therapeutic environment, a small and cost-effective system is desired. Therefore, the pose-tracking system should run on edge devices.
Aim: In this project, ICAIPose should be implemented on the FPGA edge device Kria KV 260 form AMD-Xilinx. As ICAIPose was designed for GPU, the required effort to run such a given network on an FPGA and the performance impact is of major interest.
Approach: The application requires a camera interface and a deep learning processing unit. To test these hardware parts, a given example project that uses these parts was run on the Kria board first. Then, AMD-Xilinx's Vitis AI is used to compile the ICAIPose network for the Deep-Learning Processor Unit (DPU) on the FPGA with minor adjustments to the network. The included Vitis AI Runtime Engine with its Python API communicates with the DPU via an embedded Linux on the FPGAs microprocessor.
Conclusion: ICAIPose is a very large neural network with more than 100 GOps to process one frame. Nevertheless, on the KV260 a throughput of 8 frames per second could be achieved. The GPU-based NVIDIA Jetson Xavier NX, which costs more than double as the Kria board, achieves a similar frame rate.
The successful implementation of ICAIPose on an edge device with promising performance opens the field for broad applications in the therapeutic environment.
The Vitis AI framework from AMD-Xilinx has been extensively tested and shows its strengths, but also some teething problems. For running deep neural networks on FPGAs, Vitis AI is a framework with a good trade-off between development time and performance. It should be considered before implementing hardware-accelerated algorithms in HDL or HLS.
Prerequisite- Linux host PC with Vitis AI installed
 - Knowledge of the Vitis AI workflow
 - Internet access for the KV260
 
The ususal output of a HPE networks are confidence maps for given keypoints of the human pose. For the task of a single HPE, the maximum of the confidence map is found and the corresponding keypoint is assigned.
The camera interface is an important part of the design. With the Kria KV260 Basic Accessory Pack, a small camera is included.
AMD-Xilinx provides an example application for the Kria™ KV260 Vision AI Starter Kit Applications.
The block design of the smart camera application shows that the hardware platform contains everything we need for this project including hardware interface for the camera and the DPU. This example application can be used as a base design to run custom Vitis AI models with the camera.
First, all versions are carefully checked to make sure they match:
- For Vitis AI 1.4 and previous versions the board image for the KV260 is 2020.2
 - This requires using the Smartcamera app, which also uses the 2020.2 board image (not the latest version).
 - The Vitis AI version of the 2020.2 smartcamera platform is Vitis AI 1.3.0
 
Follow this instructions to install the smartcamera app on the KV260 (until and with section 5).
Connect the KV260 board to a local network using the ethernet port.
While being connected via UART/JTAG, check the assigned IP address of the ethernet (eth0) port.
ifconfigThe output of that command, would look similar to that:
eth0      Link encap:Ethernet  HWaddr 00:0a:35:00:22:01  
          inet addr:152.96.212.163  Bcast:152.96.212.255  Mask:255.255.255
          inet6 addr: fe80::20a:35ff:fe00:2201/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:67 errors:0 dropped:0 overruns:0 frame:0
          TX packets:51 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:9478 (9.2 KiB)  TX bytes:5806 (5.6 KiB)
          Interrupt:44in this case the ip-address is 152.96.212.163.
Use this address to connect from a host PC (connected to the same network as the KV260) over the network with ssh to the KV260.
ssh petalinux@<ip-address>To run all Vitis AI examples, some further installations have to be done on the KV260. Ensure that the device is connected to the internet.
X11-Forwarding
sudo dnf install packagegroup-petalinux-x11Set the display environment
export DISPLAY=:0.0Vitis AI
sudo dnf install packagegroup-petalinux-vitisaiOpenCV
sudo dnf install packagegroup-petalinux-opencvTar
sudo dnf install xzVitis AI Runtime
sudo wget https://www.xilinx.com/bin/public/openDownload?filename=vitis-ai-runtime-1.3.0.tar.gz
sudo tar -xzvf openDownload\?filename\=vitis-ai-runtime-1.3.0.tar.gz
cd vitis-ai-runtime-1.3.0/aarch64/centos/
sudo bash setup.shThe Vitis AI Runtime (VART) expects the location of a DPU.xclbin file in the vart.conf file. But the corresponding xclbin file is the one from the smartcam app.
Change the vart.conf and the xclbin file with the following commands.
echo "firmware: /lib/firmware/xilinx/kv260-smartcam/kv260-smartcam.xclbin" | sudo tee /etc/vart.conf
sudo cp /lib/firmware/xilinx/kv260-smartcam/kv260-smartcam.xclbin /usr/lib/
sudo mv /usr/lib/kv260-smartcam.xclbin /usr/lib/dpu.xclbinBefore Vitis AI example can be run, the according smartcam application must be loaded (after every boot).
sudo xmutil unloadapp
sudo xmutil loadapp kv260-smartcamWhen the KV260-smartcam app is loaded, the camera can be tested with X11-forwarding with the following GStreamer command:
gst-launch-1.0 mediasrcbin media-device=/dev/media0 v4l2src0::io-mode=dmabuf v4l2src0::stride-align=256  ! video/x-raw, width=256, height=256, format=NV12, framerate=30/1 ! videoconvert! ximagesinkIf a display is connected via HDMI (1920x1200 in this case), the camera can be tested as well. Please change the width and height parameter according to your connected display
gst-launch-1.0 mediasrcbin media-device=/dev/media0 v4l2src0::io-mode=dmabuf v4l2src0::stride-align=256 ! video/x-raw, width=1920, height=1200, format=NV12, framerate=30/1  ! kmssink driver-name=xlnx plane-id=39 sync=false fullscreen-overlay=trueVitis AI Model ZooIn the next project step, the camera system is tested in combination with Vitis AI. With the wide collection of pre-trained neural networks from Vitis AI Model Zoo, an example can be chosen.
Hourglass is a HPE network with following properties:
cf_hourglass_mpii_256_256_10.2G_2.0
- Description: Pose Estimation Model with Hourglass
 - Input size: 256x256
 - Float ops: 10.2G
 - Task: pose estimation
 - Framework: caffe
 - Prune: 'no'
 - Newest version: Vitis AI 2.0
 
The precompiled version of the model for the KV260 has been compiled with Vitis AI 2.0 with the DPU configuration B4096.
We use the DPU configuration B3136. Therefore, the hourglass model has to be recompiled with the caffe workflow for the corresponding DPU and the correct Vitis AI version 1.3.0 (Docker image: xilinx/vitis-ai-cpu:1.3.411).
The DPU fingerprint and the corresponding arch.json file can be found in the smartcam documentation.
{
    "fingerprint":"0x1000020F6014406"
}The newly compiled files model can be saved on the KV260.
The test application (test_video_hourglass) provided by the Vitis AI Library is used to run the model. To do so, use the prebuilt files provided in this project or compile the test application with the cross-compilation system environment on the host PC for the KV260 (follow the Vitis AI instructions).
Download and extract the prebuilt files on the KV260.
wget https://github.com/Nunigan/HardwareAcceleratedPoseTracking/raw/main/prebuilt.tar.xz
tar -xf prebuilt.tar.xzGo to the hourglass folder
cd prebuilt/hourglass/The GStreamer string from the camera interface is used as the input device. With the following command, the program runs with two threads.
./test_video_hourglass hourglass_kv.xmodel "mediasrcbin media-device=/dev/media0 v4l2src0::io-mode=dmabuf v4l2src0::stride-align=256  ! video/x-raw, width=256, height=256, format=NV12, framerate=30/1 ! videoconvert ! appsink" -t 2Hourglass runs with 30 fps. Note That the limiting factor is the camera and not the neural network.
Vitis AIThe main part of the project was to use a neural network designed for a conventional GPU implementation with Tensorflow and try to run it on an FPGA.
ICAIPose is a fairly large network with about 11 million learnable parameters and 103 GOps to process an image.
The original network is composed with following layers:
Conv2DPReLU activation functionConcatenateUpSampling2DDepthwiseConv2DMaxPooling2D
For the usage with Vitis AI, one has to check if all layers of the neural network are supported by Vitis AI (see the corresponding user guide).
All layers but the PReLU activation function are supported. The "Parametric ReLU" is very similar to the Leaky ReLU function (see following figure) except that the leak-term is a learnable parameter. Vitis AI supports Leaky ReLU with a fixed leak-term of 0.1.
Some challenges came with the introduction of the Leaky ReLU activation function. Unfortunately, Leaky ReLU is not supported for the Tensorflow 2 (TF2) workflow (only for Vitis AI 1.3, it is supported by newer versions). Therefore, the Tensorflow 1 (TF1) workflow is used.
ICAIPose is written and trained in the TF2 (Keras) framework. A saved model from Keras' h5 files can be converted into a TF1 Checkpoint and meta graph with the following code.
import tensorflow as tf
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
tf.keras.backend.set_learning_phase(0)
loaded_model = tf.keras.models.load_model('model.h5')
print ('Keras model information:')
print (' Input names :',loaded_model.inputs)
print (' Output names:',loaded_model.outputs)
print('-------------------------------------')
tfckpt = 'train/tfchkpt.ckpt'
tf_session = tf.keras.backend.get_session()
# write out tensorflow checkpoint & meta graph
saver = tf.compat.v1.train.Saver()
save_path = saver.save(tf_session,tfckpt)
print (' Checkpoint created :',tfckpt)Even though Keras is supported by TF1 and TF2, a Keras network saved in TF2 and then converted to TF1 cannot be processed by Vitis AI.
Therefore, the network is saved in Keras from TF1.
Since the model has been previously trained in TF2 (Keras) the weights must be exported.
#Running in TF2
model = keras.models.load_model("trained_model.h5")
model.save_weights("weights.h5")and then imported in TF1.
#Running in TF1.15
#Description of the model
##########################################
model.load_weights("weights.h5")
model.save("TF1_model.h5")After the conversion to the TF1 files, the normal Vitis AI flow (Freezing, quantization and compilation) can be used.
A test dataset is used for quantization. Save the dataset in the validation_data folder.
The code to run ICAIPose is adapted from an example by Mario Bergeron on GitHub. It uses the normal VART API for python. The script run_ICAIPose_FPGA.py is used to run the model on the FPGA in a multi-threaded environment.
The following library must be installed to use the script:
sudo pip3 install imutilsGo to the folder prebuilt/ICAIPose and run the script.
python3 run_ICAIPose_FPGA.py <number of threads> <model_file>For example
python3 run_ICAIPose_FPGA.py 3 own_network_256_KV260.xmodelClose the window by pressing letter 'q' on the host PC. The throughput performance is printed to the terminal.
Or with a HDMI Display
python3 run_ICAIPose_FPGA_hdmi.py <number of threads> <model_file> <Display Width> <Display Height>For example
python3 run_ICAIPose_FPGA_hdmi.py 3 own_network_256_KV260.xmodel 1920 1200It is recommended to run the application over HDMI, depending on the network connection the throughput of the video stream may appear worse.
ResultsIt is interesting to know how fast the network is running on the FPGA, but it is important to know if some HPE performance is lost due to the quantization.
Throughput Performance
ICAIPose (256x256, 103 GOps): 8 fps
With the B3136 DPU and a clock frequency of 300 MHz, a theoretical throughput of 940 GOps/s is given.
Therefore, the results are in the expected range (recall: 103 GOps for one image).
For comparison, the NVIDIA Jetson Xavier NX which is more expensive than a KV260 and has a significant higher theoretical throughput (21 TOps) reached the same throughput of 8 fps.
Human Pose Estimation Performance
This dataset that provides more than 2000 images and the corresponding ideal confidence map is used to test the HPE performance.
The Mean Squared Error (MSE) of the normalized confidence maps is computed by squaring and summing up the difference between every pixel. An example is shown in the following image. The mean of the image on the left is the MSE of the given input.
We can now compare the MSE between the quantized and the float network. As an additional information the MSE of the original network with the PReLU activation Function is shown.
The MSE of all images:
Float: 0.8109Quantized INT8: 0.9332PReLU: 0.9348
The change from the PRelu to the Leaky ReLU activation function increased the network performance even a bit. The quantization has an impact on the MSE, but a rather small one. The quantized networked performed as well as the unquantized PReLU network
The Vitis AI framework from AMD-Xilinx was extensively tested and showed its strengths and some teething troubles. Changing the target device from GPU to FPGA was possible without loosing significant performance even though the FPGA board is cheaper. Vitis AI allows to design an efficient deep neuronal network for an FPGA without knowledge in HDL or HLS.
The Kria KV260 Vision AI Starter Kit is a great choice to start with Vitis AI. The provided camera can easily be used within the petalinux environment.
ReferencesAcknowledgmentSpecial Thank to the ICAI and VRM Switzerland for providing a trained version of ICAIPose.
Thanks to the Institute of Microelectronics and Embedded Systems for supporting this challenge as part of a student project.
Revision History- 3/14/2022 - Initial release
 









Comments