This project is part of a series on the subject of deploying the MediaPipe models to the edge on embedded platforms.
If you have not already read part 1 of this series, I urge you to start here:
In this project, I start by giving a recap of the challenges that can be expected when deploying the MediaPipe models.
Then I will address these challenges one by one, before deploying the models with the AzurEngine RPP R8 AI Accelerator.
Finally, I will perform profiling to determine if our goal of acceleration was achieved.
AzurEngine SDK OverviewAzurEngine's AI acceleration is based on their Circular Reconfigurable Parallel Processor (RPP).
The advantage of the RPP architecture, compared to GPUs, is higher power efficiency, lower latency, and smaller area, while preserving high performance and ease of deployment.
Their first generation silicon is the R8 family, more specifically the AE7100 device:
- 1024 processing cores
- 32 TOPS (INT8/INT32) / 16 TFLOPS (BF16/FP32)
- on-chip memory : 24MB SRAM
- external memory : 16GB LPDDR4 @ 59.7 GB/s
- PCIe Gen 3.0, 4 lanes
- H.265/H.264 Decoder (32 channels @ 1080P30)
- MJPEG/JPEG Codec (210 Mpixel/s for YUV 4:2:2)
- 15W typical power
One number that may stand out if comparing to other AI accelerators is the typical power consumption. Note that this includes the LPDDR4 power consumption. The device has a low power mode, which can scale down the power consumption.
This external DDR offers the ability to run large language models (LLMs), in addition to traditional convolutional neural networks (CNNs).
In addition to the RPP R8 device, AzurEngine offers a scalable range of PCIe Gen 3.0 compatible AI accelerators.
This project will only cover the following M.2 AI acceleration modules:
- K08-1 : M.2 M Key (PCIe Gen 3.0, 4 lanes), 32 TOPS
The AzurEngine SDK supports the following frameworks:
- ONNX
- PyTorch
Other frameworks are indirectly supported by exporting or converting to the ONNX format.
The deployment involves the following tasks:
- Model Conversion and Optimization
- Model Execution
The AzurEngine RPP run-time expects the models to be in ONNX format. This is accomplished by either exporting (from PyTorch) on converting (for TFLite) the model to the ONNX format. In addition, the ONNX model is simplified for more efficient inference.
In this project, we will be using version v1.6.11.7 of the AzurEngine SDK:
- AzurEngine SDK : v1.6.11.7
The first challenge that I encountered, in part 1, was the reality that the performance of the MediaPipe models significantly degrades when run on embedded platforms, compared to modern computers. This is the reason I am attempting to accelerate the models with the AzurEngine RPP-R8 accelerator.
The second challenge is the fact that Google does not provide the dataset that was used to train the MediaPipe models. Since quantization usually requires a subset of this training data, this presents us with the challenge of coming up with this data ourselves. With the AzurEngine flow, however, this is not required !
Creating a Calibration Dataset for QuantizationContrary to most other AI accelerator flows, we do not need a calibration data set to deploy a model to AzurEngine's RPP-R8 accelerator.
This leads to several additional advantages:
- There is no need for calibration data, since quantization is not required
- There is no need for a GPU, since quantization is not required
As we saw previously in the "AzurEngine SDK Overview" section, the deployment phase starts with a conversion to ONNX format.
The first phase is to convert the TFLite models to ONNX, using the tf2onnx utility.
The following excerpt demonstrates the conversion, as well as model optimizations that are performed for the "v0.10 full" version of the palm detection and hand landmark models.
python3 -m tf2onnx.convert --opset 12 --tflite ../../blaze_tflite/models/palm_detection_full.tflite --output palm_detection_full.onnx
2025-11-17 20:57:10.048307: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-17 20:57:10.082854: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-17 20:57:10.083304: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-17 20:57:10.568500: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/azurengine/miniconda3/envs/az_samples_env_tria_mediapipe/lib/python3.8/runpy.py:127: RuntimeWarning: 'tf2onnx.convert' found in sys.modules after import of package 'tf2onnx', but prior to execution of 'tf2onnx.convert'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
2025-11-17 20:57:11.563730: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2025-11-17 20:57:11,565 - INFO - Using tensorflow=2.13.1, onnx=1.16.2, tf2onnx=1.16.1/15c810
2025-11-17 20:57:11,565 - INFO - Using opset <onnx, 12>
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
2025-11-17 20:57:11,877 - INFO - Optimizing ONNX model
2025-11-17 20:57:12,536 - INFO - After optimization: Cast -137 (137->0), Const -35 (175->140), Identity -2 (2->0), Reshape -28 (32->4), Transpose -259 (264->5)
2025-11-17 20:57:12,555 - INFO -
2025-11-17 20:57:12,555 - INFO - Successfully converted TensorFlow model ../../blaze_tflite/models/palm_detection_full.tflite to ONNX
2025-11-17 20:57:12,556 - INFO - Model inputs: ['input_1']
2025-11-17 20:57:12,556 - INFO - Model outputs: ['Identity', 'Identity_1']
2025-11-17 20:57:12,556 - INFO - ONNX model is saved at palm_detection_full.onnx
python3 -m tf2onnx.convert --opset 12 --tflite ../../blaze_tflite/models/hand_landmark_full.tflite --output hand_landmark_full.onnx
2025-11-17 20:57:16.244487: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-17 20:57:16.279447: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-17 20:57:16.279853: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-17 20:57:16.765567: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/azurengine/miniconda3/envs/az_samples_env_tria_mediapipe/lib/python3.8/runpy.py:127: RuntimeWarning: 'tf2onnx.convert' found in sys.modules after import of package 'tf2onnx', but prior to execution of 'tf2onnx.convert'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
2025-11-17 20:57:17.765770: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2025-11-17 20:57:17,767 - INFO - Using tensorflow=2.13.1, onnx=1.16.2, tf2onnx=1.16.1/15c810
2025-11-17 20:57:17,767 - INFO - Using opset <onnx, 12>
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
2025-11-17 20:57:18,064 - INFO - Optimizing ONNX model
2025-11-17 20:57:18,516 - INFO - After optimization: Cast -102 (102->0), Const -79 (183->104), GlobalAveragePool +1 (0->1), Identity -4 (4->0), ReduceMean -1 (1->0), Reshape -16 (16->0), Squeeze +1 (0->1), Transpose -191 (192->1)
2025-11-17 20:57:18,548 - INFO -
2025-11-17 20:57:18,548 - INFO - Successfully converted TensorFlow model ../../blaze_tflite/models/hand_landmark_full.tflite to ONNX
2025-11-17 20:57:18,548 - INFO - Model inputs: ['input_1']
2025-11-17 20:57:18,548 - INFO - Model outputs: ['Identity', 'Identity_1', 'Identity_2', 'Identity_3']
2025-11-17 20:57:18,548 - INFO - ONNX model is saved at hand_landmark_full.onnxThe second phase is to further optimize the model using the onnxsim utility.
The following excerpt demonstrates the additional optimizations that are performed for the "v0.10 full" version of the palm detection and hand landmark models.
onnxsim ../../blaze_onnx/models/palm_detection_full.onnx palm_detection_full_sim.onnx
Simplifying...
Finish! Here is the difference:
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ ┃ Original Model ┃ Simplified Model ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ Add │ 30 │ 30 │
│ Concat │ 2 │ 2 │
│ Conv │ 63 │ 63 │
│ MaxPool │ 4 │ 4 │
│ PRelu │ 31 │ 31 │
│ Pad │ 3 │ 3 │
│ Reshape │ 4 │ 4 │
│ Resize │ 2 │ 2 │
│ Transpose │ 5 │ 5 │
│ Model Size │ 4.4MiB │ 4.4MiB │
└────────────┴────────────────┴──────────────────┘
onnxsim ../../blaze_onnx/models/hand_landmark_full.onnx hand_landmark_full_sim.onnx
Simplifying...
Finish! Here is the difference:
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ ┃ Original Model ┃ Simplified Model ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ Add │ 9 │ 9 │
│ Clip │ 32 │ 32 │
│ Conv │ 47 │ 47 │
│ Gemm │ 4 │ 4 │
│ GlobalAveragePool │ 1 │ 1 │
│ Sigmoid │ 2 │ 2 │
│ Squeeze │ 1 │ 1 │
│ Transpose │ 1 │ 1 │
│ Model Size │ 10.4MiB │ 10.4MiB │
└───────────────────┴────────────────┴──────────────────┘Model ExecutionIn order to support execution on RPP, I will be using the "blaze_app_python".
The "blaze_app_python" application is a sub-set of the mediapipe framework, allowing inference of individual models. This allows to run inference on various target hardware, and use a common code base for profiling.
This application was augmented with the following inference targets:
- blaze_onnx : in order to run inference with the ONNX runtime
- blaze_rpp : in order to run inference with the RPP runtime
My final inference code for RPP inference can be found in the "blaze_app_python" repository, under the blaze_rpp sub-directory:
We will also use the "hand_controller" application, which builds on top of "blaze_app_python", to implement a sign language recognition application.
Installing the AzurEngine SDKAzurEngine provides documentation for installing their SDK, their RPP linux kernel driver, and their RPP runtime.
Once installed, we launch the conda environment that was created during installation of the SDK:
conda activate az_samples_envInstalling the python applicationThe python demo application requires certain packages which can be installed as follows:
pip3 install tensorflow
pip3 install onnx onnxruntime
pip3 install tf2onnx onnxsimThe python application can be accessed from the following github repository:
git clone --recursive https://github.com/AlbertaBeef/hand_controller
cd hand_controller/blaze_app_pythonIn order to successfully use the python demo with the original TFLite models, they need to be downloaded from the google web site:
cd blaze_tflite/models
source ./get_tflite_models.sh
cd ../..Convert the TFLite models to ONNX, as follows:
cd blaze_onnx/models
source ./convert_models.sh
cd ../..Convert the ONNX models for use with RPP, as follows:
cd blaze_rpp/models
source ./convert_models.sh
cd ../..You are all set !
Launching the python applicationThe python application can launch many variations of the dual-inference pipeline, which can be filtered with the following arguments:
- --blaze : hand | face | pose
- --target : blaze_tflite |... | blaze_onnx | blaze_rpp
- --pipeline : specific name of pipeline (can be queried with --list argument)
In order to display the complete list of supported pipelines, launch the python script as follows:
$ python3 blaze_detect_live.py --list
[INFO] user@hosthame : azurengine@azurengine
[INFO] blaze_tflite supported ...
...
[INFO] blaze_onnx supported ...
[INFO] blaze_rpp supported ...
Command line options:
--input :
--testimage : False
--blaze : hand,face,pose
--target : blaze_tflite,blaze_tflite_quant,blaze_pytorch,blaze_vitisai,blaze_hailo,blaze_onnx,blaze_rpp
--pipeline : all
--list : True
--verbose : False
--debug : False
--withoutview : False
--profilelog : False
--profileview : False
--fps : False
List of target pipelines:
...
11 rpp_hand_v0_07 blaze_rpp/models/palm_detection_v0_07_sim.onnx
blaze_rpp/models/hand_landmark_v0_07_sim.onnx
12 rpp_hand_v0_10_lite blaze_rpp/models/palm_detection_lite_sim.onnx
blaze_rpp/models/hand_landmark_lite_sim.onnx
13 rpp_hand_v0_10_full blaze_rpp/models/palm_detection_full_sim.onnx
blaze_rpp/models/hand_landmark_full_sim.onnx
...
27 rpp_face_v0_07_front blaze_rpp/models/face_detection_front_v0_07_sim.onnx
blaze_rpp/models/face_landmark_v0_07_sim.onnx
28 rpp_face_v0_07_back blaze_rpp/models/face_detection_back_v0_07_sim.onnx
blaze_rpp/models/face_landmark_v0_07_sim.onnx
29 rpp_face_v0_10_short blaze_rpp/models/face_detection_short_range_sim.onnx
blaze_rpp/models/face_landmark_sim.onnx
30 rpp_face_v0_10_full blaze_rpp/models/face_detection_full_range_sim.onnx
blaze_rpp/models/face_landmark_sim.onnx
...
43 rpp_pose_v0_10_lite blaze_tflite/models/pose_detection.tflite
blaze_rpp/models/pose_landmark_lite_sim.onnx
44 rpp_pose_v0_10_full blaze_tflite/models/pose_detection.tflite
blaze_rpp/models/pose_landmark_full_sim.onnx
45 rpp_pose_v0_10_heavy blaze_tflite/models/pose_detection.tflite
blaze_rpp/models/pose_landmark_heavy_sim.onnxIn order to launch the RPP pipeline for palm detection and hand landmarks, with the monitor, use the python script as follows::
python3 blaze_detect_live.py --pipeline=rpp_hand_v0_10_liteThis will launch the 0.10 (lite) version of the models, as shown below:
The previous video has not been accelerated. It shows the frame rate to be 30 fps when no hands are detected (one model running : palm detection), as well as when one hand has been detected (two models running : palm detection and hand landmarks), and when two hands have been detected (three models running : palm detection and 2 hand landmarks).
In order to know the true performance of the models running on the RPP accelerator, we will need to detach from the USB camera (which is determining the frame rate of 30fps). We will be doing this in the next section.
Benchmarking the modelsThe profiling functionality uses a test image that can be downloaded from Google as follows:
source ./get_test_images.shThe following commands can be used to generate profile results for the rpp_hand_v0_10_lite pipeline with the RPP runtime, and the test image:
rm blaze_detect_live.csv
python3 blaze_detect_live.py --pipeline=rpp_hand_v0_10_lite --image --withoutview --profilelog
mv blaze_detect_live.csv blaze_detect_live_ryzen5600gt_rpp_hand_v0_10_lite.csvThe following commands can be used to generate profile results for the tfl_hand_v0_10_lite pipeline using the TFLite models, and the test image:
rm blaze_detect_live.csv
python3 blaze_detect_live.py --pipeline=tfl_hand_v0_10_lite --image --withoutview --profilelog
mv blaze_detect_live.csv blaze_detect_live_ryzen5600gt_tfl_hand_v0_10_lite.csvThe same is done for the other models for each mediapipe pipeline.
The results of all.csv files were averaged, then plotted using Excel.
Here are the profiling results for the 0.07 and 0.10 versions of the models deployed with RPP, in comparison to the reference TFLite models:
It is worth noting that these benchmarks have been taken with a single-threaded python script. There is additional opportunity for acceleration with a multi-threaded implementation. While the graph runner is waiting for transfers from one model's sub-graphs, another (or several other) model(s) could be launched in parallel...
There is also an opportunity to accelerate the rest of the pipeline with C++ code...
Competitive ComparisonThis section compares the performance of the AzurEngine K08-1, compared to similar M.2 modules from other vendors, specifically using my blaze_app_python application:
- AzurEngine : K08-1 (M.2 M-Key, 4 lanes)
- Hailo : Hailo-8 (M.2 M-Key, 4 lanes)
- MemryX : MX3 (M.2 M-Key, 2 lanes)
Before comparing the acceleration achieved for each of these M.2 AI accelerators, we need to define a common baseline.
Although my Hailo-8 and MemryX benchmarks were taking on a HP Z4 G4 workstation (with Intel Xeon), and the AzurEngine benchmarks were taken on a TUF Gaming workstation (with Ryzen 5 5600GT), the reference TFLite models have almost identical execution times.
We can compare the Latency Benchmarks for our three M.2 AI accelerator cards:
Remembering that these our latency benchmarks (execution times), and that shorter is better, we can quickly observe that AzurEngine's RPP-R8 accelerator is achieving from 3X up to 6X more acceleration than the other two accelerators for individual models.
The current version of this project has models for the following pipelines:
- palm detection : working
- hand landmarks : working
- face detection : working
- face landmarks : working
- pose detection : not supported (pose detection model does not convert)
- pose landmarks : working
The current platforms have been tested:
- TUF Gaming PC (Ryzen 5 5600GT), with K08-1 (M Key, 4 lanes)
I want to mention some issues I ran into, only to emphasize the fact that these were quickly resolved by the AzurEngine team.
Unsupported option in PAD layer
As with many of the other vendors I explored, the padding in the channel dimension was initially not supported.
In contrast with the other vendors, however, AzurEngine quickly added support for this option.
This was resolved as an update in SDK v1.6.11.1.
For anyone having dealt with tool issues, the quick turn-around is a testimony to AzurEngine's customer support and engineering efficiency !
Latency during data transfers
Although not an issue, per say, this showed up with the Python API, which was using a normal malloc instead of a dedicated allocation for the AzurEngine runtime.
One solution would have been to abandon the python implementation for a C++ implementation. This, however, was not required as AzurEngine's engineering team quick updated the python API.
This was resolved as an update in SDK v1.6.11.7.
Another testimony to AzurEngine's customer support and engineering efficiency !
When I thanked AzurEngine for their exceptional support, they responded : "Fixing these issues was actually very easy. Thanks to our configuration architecture, we can iterate very quickly on new features."
Considering that today, state-of-the-art (SOTA) means new models and architectures on a weekly basis, this is very comforting to know.
ConclusionI only scratched the surface of AzurEngine's functionnality. There are so may other features to explore and discover, of which the most interesting are:
- LLM execution
- CUDA compatibility
For which embedded platforms would you like to see the AzurEngine AI acceleration module supported ?
Let me know if the comments...
AcknowledgementsI want to thank AzurEngine for the opportunity to evaluate their RPP-R8 accelerator, and the phenomenal support provided during this exploration.
Version History- 2025/12/29 - Initial Version
- 2026/01/10 - Added Competitive Comparison
- 2025/10/20 - Published
Hand Controller (by Mario Bergeron):
- Hand Controller (python version) : hand_controller
Accelerating MediaPipe (by Mario Bergeron):
- Hackster Series Part 1 : Blazing Fast Models
- Hackster Series Part 2 : Insightful Datasets for ASL recognition
- Hackster Series Part 3 : Accelerating the MediaPipe models with Vitis-AI 3.5
- Hackster Series Part 4 : Accelerating the MediaPipe models with Hailo-8
- Hackster Series Part 5 : Accelerating the MediaPipe models on RPI5 AI Kit
- Hackster Series Part 6 : Accelerating the MediaPipe models with MemryX
- Hackster Series Part 7 : Accelerating the MediaPipe models with Qualcomm
- Hackster Series Part 8 : Accelerating the MediaPipe models with AzurEngine
- Blaze Utility (python version) : blaze_app_python
- Blaze Utility (C++ version) : blaze_app_cpp
ASL Recognition using PointNet (by Edward Roe):
- Medium Article : ASL Recognition using PointNet and MediaPipe
- Kaggle Dataset : American Sign Language Dataset
- GitHub Source : pointnet_hands


Comments