Published March 14, 2022 © Apache-2.0

Hardware Accelerated Human Pose Estimation | KV260, IAS Cam

Run deep learning networks designed for human pose estimation on the Kria KV260 including its camera

IntermediateFull instructions provided2 hours4,418

Things used in this project

Hardware components

AMD Kria™ KV260 Vision AI Starter Kit

Kria KV260 Basic Accessory Pack

Software apps and online services

Vitis AI

Story

Abstract

Introduction: The latest deep learning models combined with state-of-the-art hardware can perform human pose estimation (HPE) in real time. HPE refers to the estimation of a kinematic model of the human body from image data. The Interdisciplinary Center for Artificial Intelligence (ICAI) at the University of Applied Sciences Eastern Switzerland (OST), in collaboration with VRM Switzerland, has adapted the well-known OpenPose network for HPE and made it computationally more efficient. Let's call this network for this project ICAIPose. The current prototype, which uses a multi-camera system with a first-class and computationally intensive deep learning model, runs on multiple graphics processing units (GPUs) with sufficient performance.

For the broad adoption of HPE in the therapeutic environment, a small and cost-effective system is desired. Therefore, the pose-tracking system should run on edge devices.

Aim: In this project, ICAIPose should be implemented on the FPGA edge device Kria KV 260 form AMD-Xilinx. As ICAIPose was designed for GPU, the required effort to run such a given network on an FPGA and the performance impact is of major interest.

Approach: The application requires a camera interface and a deep learning processing unit. To test these hardware parts, a given example project that uses these parts was run on the Kria board first. Then, AMD-Xilinx's Vitis AI is used to compile the ICAIPose network for the Deep-Learning Processor Unit (DPU) on the FPGA with minor adjustments to the network. The included Vitis AI Runtime Engine with its Python API communicates with the DPU via an embedded Linux on the FPGAs microprocessor.

Conclusion: ICAIPose is a very large neural network with more than 100 GOps to process one frame. Nevertheless, on the KV260 a throughput of 8 frames per second could be achieved. The GPU-based NVIDIA Jetson Xavier NX, which costs more than double as the Kria board, achieves a similar frame rate.

The successful implementation of ICAIPose on an edge device with promising performance opens the field for broad applications in the therapeutic environment.

The Vitis AI framework from AMD-Xilinx has been extensively tested and shows its strengths, but also some teething problems. For running deep neural networks on FPGAs, Vitis AI is a framework with a good trade-off between development time and performance. It should be considered before implementing hardware-accelerated algorithms in HDL or HLS.

Prerequisite

Linux host PC with Vitis AI installed
Knowledge of the Vitis AI workflow
Internet access for the KV260

Fundamentals

The ususal output of a HPE networks are confidence maps for given keypoints of the human pose. For the task of a single HPE, the maximum of the confidence map is found and the corresponding keypoint is assigned.

1 / 3 • Input image with drawn pose

Camera Interface

Kria KV260 Vision AI Starter Kit with camera

The camera interface is an important part of the design. With the Kria KV260 Basic Accessory Pack, a small camera is included.

AMD-Xilinx provides an example application for the Kria™ KV260 Vision AI Starter Kit Applications.

The block design of the smart camera application shows that the hardware platform contains everything we need for this project including hardware interface for the camera and the DPU. This example application can be used as a base design to run custom Vitis AI models with the camera.

The block design of the system provided by the smartcamera application

Modifications to the base design

First, all versions are carefully checked to make sure they match:

For Vitis AI 1.4 and previous versions the board image for the KV260 is 2020.2
This requires using the Smartcamera app, which also uses the 2020.2 board image (not the latest version).
The Vitis AI version of the 2020.2 smartcamera platform is Vitis AI 1.3.0

Follow this instructions to install the smartcamera app on the KV260 (until and with section 5).

Connect the KV260 board to a local network using the ethernet port.

While being connected via UART/JTAG, check the assigned IP address of the ethernet (eth0) port.

ifconfig

The output of that command, would look similar to that:

eth0      Link encap:Ethernet  HWaddr 00:0a:35:00:22:01  
          inet addr:152.96.212.163  Bcast:152.96.212.255  Mask:255.255.255
          inet6 addr: fe80::20a:35ff:fe00:2201/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:67 errors:0 dropped:0 overruns:0 frame:0
          TX packets:51 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:9478 (9.2 KiB)  TX bytes:5806 (5.6 KiB)
          Interrupt:44

in this case the ip-address is 152.96.212.163.

Use this address to connect from a host PC (connected to the same network as the KV260) over the network with ssh to the KV260.

ssh petalinux@<ip-address>

To run all Vitis AI examples, some further installations have to be done on the KV260. Ensure that the device is connected to the internet.

X11-Forwarding

sudo dnf install packagegroup-petalinux-x11

Set the display environment

export DISPLAY=:0.0

Vitis AI

sudo dnf install packagegroup-petalinux-vitisai

OpenCV

sudo dnf install packagegroup-petalinux-opencv

Tar

sudo dnf install xz

Vitis AI Runtime

sudo wget https://www.xilinx.com/bin/public/openDownload?filename=vitis-ai-runtime-1.3.0.tar.gz

sudo tar -xzvf openDownload\?filename\=vitis-ai-runtime-1.3.0.tar.gz

cd vitis-ai-runtime-1.3.0/aarch64/centos/

sudo bash setup.sh

The Vitis AI Runtime (VART) expects the location of a DPU.xclbin file in the vart.conf file. But the corresponding xclbin file is the one from the smartcam app.

Change the vart.conf and the xclbin file with the following commands.

echo "firmware: /lib/firmware/xilinx/kv260-smartcam/kv260-smartcam.xclbin" | sudo tee /etc/vart.conf

sudo cp /lib/firmware/xilinx/kv260-smartcam/kv260-smartcam.xclbin /usr/lib/

sudo mv /usr/lib/kv260-smartcam.xclbin /usr/lib/dpu.xclbin

Before Vitis AI example can be run, the according smartcam application must be loaded (after every boot).

sudo xmutil unloadapp
sudo xmutil loadapp kv260-smartcam

When the KV260-smartcam app is loaded, the camera can be tested with X11-forwarding with the following GStreamer command:

gst-launch-1.0 mediasrcbin media-device=/dev/media0 v4l2src0::io-mode=dmabuf v4l2src0::stride-align=256  ! video/x-raw, width=256, height=256, format=NV12, framerate=30/1 ! videoconvert! ximagesink

If a display is connected via HDMI (1920x1200 in this case), the camera can be tested as well. Please change the width and height parameter according to your connected display

gst-launch-1.0 mediasrcbin media-device=/dev/media0 v4l2src0::io-mode=dmabuf v4l2src0::stride-align=256 ! video/x-raw, width=1920, height=1200, format=NV12, framerate=30/1  ! kmssink driver-name=xlnx plane-id=39 sync=false fullscreen-overlay=true

Vitis AI Model Zoo

In the next project step, the camera system is tested in combination with Vitis AI. With the wide collection of pre-trained neural networks from Vitis AI Model Zoo, an example can be chosen.

Hourglass is a HPE network with following properties:

cf_hourglass_mpii_256_256_10.2G_2.0

Description: Pose Estimation Model with Hourglass
Input size: 256x256
Float ops: 10.2G
Task: pose estimation
Framework: caffe
Prune: 'no'
Newest version: Vitis AI 2.0

The precompiled version of the model for the KV260 has been compiled with Vitis AI 2.0 with the DPU configuration B4096.

We use the DPU configuration B3136. Therefore, the hourglass model has to be recompiled with the caffe workflow for the corresponding DPU and the correct Vitis AI version 1.3.0 (Docker image: xilinx/vitis-ai-cpu:1.3.411).

The DPU fingerprint and the corresponding arch.json file can be found in the smartcam documentation.

{
    "fingerprint":"0x1000020F6014406"
}

The newly compiled files model can be saved on the KV260.

The test application (test_video_hourglass) provided by the Vitis AI Library is used to run the model. To do so, use the prebuilt files provided in this project or compile the test application with the cross-compilation system environment on the host PC for the KV260 (follow the Vitis AI instructions).

Download and extract the prebuilt files on the KV260.

wget https://github.com/Nunigan/HardwareAcceleratedPoseTracking/raw/main/prebuilt.tar.xz

tar -xf prebuilt.tar.xz

Go to the hourglass folder

cd prebuilt/hourglass/

The GStreamer string from the camera interface is used as the input device. With the following command, the program runs with two threads.

./test_video_hourglass hourglass_kv.xmodel "mediasrcbin media-device=/dev/media0 v4l2src0::io-mode=dmabuf v4l2src0::stride-align=256  ! video/x-raw, width=256, height=256, format=NV12, framerate=30/1 ! videoconvert ! appsink" -t 2

Hourglass run on the KV260 with Vitis AI

Hourglass runs with 30 fps. Note That the limiting factor is the camera and not the neural network.

Vitis AI

The main part of the project was to use a neural network designed for a conventional GPU implementation with Tensorflow and try to run it on an FPGA.

ICAIPose is a fairly large network with about 11 million learnable parameters and 103 GOps to process an image.

The original network is composed with following layers:

Conv2D
PReLU activation function
Concatenate
UpSampling2D
DepthwiseConv2D
MaxPooling2D

For the usage with Vitis AI, one has to check if all layers of the neural network are supported by Vitis AI (see the corresponding user guide).

All layers but the PReLU activation function are supported. The "Parametric ReLU" is very similar to the Leaky ReLU function (see following figure) except that the leak-term is a learnable parameter. Vitis AI supports Leaky ReLU with a fixed leak-term of 0.1.

Leaky ReLU activation function, with the defined leak term by Vitis AI of 0.1

Some challenges came with the introduction of the Leaky ReLU activation function. Unfortunately, Leaky ReLU is not supported for the Tensorflow 2 (TF2) workflow (only for Vitis AI 1.3, it is supported by newer versions). Therefore, the Tensorflow 1 (TF1) workflow is used.

ICAIPose is written and trained in the TF2 (Keras) framework. A saved model from Keras' h5 files can be converted into a TF1 Checkpoint and meta graph with the following code.

import tensorflow as tf
import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

tf.keras.backend.set_learning_phase(0)

loaded_model = tf.keras.models.load_model('model.h5')

print ('Keras model information:')
print (' Input names :',loaded_model.inputs)
print (' Output names:',loaded_model.outputs)
print('-------------------------------------')

tfckpt = 'train/tfchkpt.ckpt'

tf_session = tf.keras.backend.get_session()

# write out tensorflow checkpoint & meta graph
saver = tf.compat.v1.train.Saver()
save_path = saver.save(tf_session,tfckpt)
print (' Checkpoint created :',tfckpt)

Even though Keras is supported by TF1 and TF2, a Keras network saved in TF2 and then converted to TF1 cannot be processed by Vitis AI.

Therefore, the network is saved in Keras from TF1.

Since the model has been previously trained in TF2 (Keras) the weights must be exported.

#Running in TF2

model = keras.models.load_model("trained_model.h5")
model.save_weights("weights.h5")

and then imported in TF1.

#Running in TF1.15
#Description of the model
##########################################

model.load_weights("weights.h5")
model.save("TF1_model.h5")

After the conversion to the TF1 files, the normal Vitis AI flow (Freezing, quantization and compilation) can be used.

A test dataset is used for quantization. Save the dataset in the validation_data folder.

Run the network with VART

The code to run ICAIPose is adapted from an example by Mario Bergeron on GitHub. It uses the normal VART API for python. The script run_ICAIPose_FPGA.py is used to run the model on the FPGA in a multi-threaded environment.

The following library must be installed to use the script:

sudo pip3 install imutils

Go to the folder prebuilt/ICAIPose and run the script.

python3 run_ICAIPose_FPGA.py <number of threads> <model_file>

For example

python3 run_ICAIPose_FPGA.py 3 own_network_256_KV260.xmodel

Close the window by pressing letter 'q' on the host PC. The throughput performance is printed to the terminal.

Or with a HDMI Display

python3 run_ICAIPose_FPGA_hdmi.py <number of threads> <model_file> <Display Width> <Display Height>

For example

python3 run_ICAIPose_FPGA_hdmi.py 3 own_network_256_KV260.xmodel 1920 1200

It is recommended to run the application over HDMI, depending on the network connection the throughput of the video stream may appear worse.

Results

ICAIPose on the Kria KV260

It is interesting to know how fast the network is running on the FPGA, but it is important to know if some HPE performance is lost due to the quantization.

Throughput Performance

ICAIPose (256x256, 103 GOps): 8 fps

With the B3136 DPU and a clock frequency of 300 MHz, a theoretical throughput of 940 GOps/s is given.

Therefore, the results are in the expected range (recall: 103 GOps for one image).

For comparison, the NVIDIA Jetson Xavier NX which is more expensive than a KV260 and has a significant higher theoretical throughput (21 TOps) reached the same throughput of 8 fps.

Human Pose Estimation Performance

This dataset that provides more than 2000 images and the corresponding ideal confidence map is used to test the HPE performance.

The Mean Squared Error (MSE) of the normalized confidence maps is computed by squaring and summing up the difference between every pixel. An example is shown in the following image. The mean of the image on the left is the MSE of the given input.

Example of the Mean Squared error with an example input image from the dataset

We can now compare the MSE between the quantized and the float network. As an additional information the MSE of the original network with the PReLU activation Function is shown.

The MSE of all images:

Float: 0.8109
Quantized INT8: 0.9332
PReLU: 0.9348

The change from the PRelu to the Leaky ReLU activation function increased the network performance even a bit. The quantization has an impact on the MSE, but a rather small one. The quantized networked performed as well as the unquantized PReLU network

Conclusion

The Vitis AI framework from AMD-Xilinx was extensively tested and showed its strengths and some teething troubles. Changing the target device from GPU to FPGA was possible without loosing significant performance even though the FPGA board is cheaper. Vitis AI allows to design an efficient deep neuronal network for an FPGA without knowledge in HDL or HLS.

The Kria KV260 Vision AI Starter Kit is a great choice to start with Vitis AI. The provided camera can easily be used within the petalinux environment.

References

Acknowledgment

Special Thank to the ICAI and VRM Switzerland for providing a trained version of ICAIPose.

Thanks to the Institute of Microelectronics and Embedded Systems for supporting this challenge as part of a student project.

Revision History

3/14/2022 - Initial release

Credits

Michael Schmid

3 projects • 12 followers

Embedded AI Enthusiast

Thanks to ICAI Interdisciplinary Center for Artificial Intelligence, VRM Switzerland, and IMES Institute of Microelectronics and Embedded Systems.

Hardware Accelerated Human Pose Estimation | KV260, IAS Cam

Things used in this project

Hardware components

Software apps and online services

Story

Abstract

Prerequisite

Fundamentals

Camera Interface

Modifications to the base design

Vitis AI Model Zoo

Vitis AI

Run the network with VART

Results

Conclusion

References

Acknowledgment

Revision History

Code

Hardware Accelerated Pose Tracking

Credits

Michael Schmid

Comments

Embed the widget on your own site

Hardware Accelerated Human Pose Estimation | KV260, IAS Cam

Hardware Accelerated Human Pose Estimation | KV260, IAS Cam

Things used in this project

Hardware components

Software apps and online services

Story

Abstract

Prerequisite

Fundamentals

Camera Interface

Modifications to the base design

Vitis AI Model Zoo

Vitis AI

Run the network with VART

Results

Conclusion

References

Acknowledgment

Revision History

Code

Hardware Accelerated Pose Tracking

Credits

Michael Schmid

Comments

Related channels and tags