Published January 20, 2026 © Apache-2.0

Accelerating the MediaPipe models with AzurEngine

An exploration of accelerating the MediaPipe models with AzurEngine

IntermediateFull instructions provided4 hours171

Story

Introduction

This project is part of a series on the subject of deploying the MediaPipe models to the edge on embedded platforms.

If you have not already read part 1 of this series, I urge you to start here:

Blazing Fast Models

In this project, I start by giving a recap of the challenges that can be expected when deploying the MediaPipe models.

Then I will address these challenges one by one, before deploying the models with the AzurEngine RPP R8 AI Accelerator.

Finally, I will perform profiling to determine if our goal of acceleration was achieved.

AzurEngine SDK Overview

AzurEngine's AI acceleration is based on their Circular Reconfigurable Parallel Processor (RPP).

RPP Architecture - Benefits compared to GPU (📷: AzurEngine)

The advantage of the RPP architecture, compared to GPUs, is higher power efficiency, lower latency, and smaller area, while preserving high performance and ease of deployment.

Their first generation silicon is the R8 family, more specifically the AE7100 device:

1024 processing cores
32 TOPS (INT8/INT32) / 16 TFLOPS (BF16/FP32)
on-chip memory : 24MB SRAM
external memory : 16GB LPDDR4 @ 59.7 GB/s
PCIe Gen 3.0, 4 lanes
H.265/H.264 Decoder (32 channels @ 1080P30)
MJPEG/JPEG Codec (210 Mpixel/s for YUV 4:2:2)
15W typical power

One number that may stand out if comparing to other AI accelerators is the typical power consumption. Note that this includes the LPDDR4 power consumption. The device has a low power mode, which can scale down the power consumption.

This external DDR offers the ability to run large language models (LLMs), in addition to traditional convolutional neural networks (CNNs).

In addition to the RPP R8 device, AzurEngine offers a scalable range of PCIe Gen 3.0 compatible AI accelerators.

This project will only cover the following M.2 AI acceleration modules:

K08-1 : M.2 M Key (PCIe Gen 3.0, 4 lanes), 32 TOPS

RPP-R8 - Silicon and M.2 AI Accelerator module (📷: AzurEngine)

The AzurEngine SDK supports the following frameworks:

ONNX
PyTorch

Other frameworks are indirectly supported by exporting or converting to the ONNX format.

The deployment involves the following tasks:

Model Conversion and Optimization
Model Execution

The AzurEngine RPP run-time expects the models to be in ONNX format. This is accomplished by either exporting (from PyTorch) on converting (for TFLite) the model to the ONNX format. In addition, the ONNX model is simplified for more efficient inference.

Model Conversion and Optimization (📷: AlbertaBeef)

In this project, we will be using version v1.6.11.7 of the AzurEngine SDK:

AzurEngine SDK : v1.6.11.7

Challenges of deploying MediaPipe models

The first challenge that I encountered, in part 1, was the reality that the performance of the MediaPipe models significantly degrades when run on embedded platforms, compared to modern computers. This is the reason I am attempting to accelerate the models with the AzurEngine RPP-R8 accelerator.

The second challenge is the fact that Google does not provide the dataset that was used to train the MediaPipe models. Since quantization usually requires a subset of this training data, this presents us with the challenge of coming up with this data ourselves. With the AzurEngine flow, however, this is not required !

Creating a Calibration Dataset for Quantization

Contrary to most other AI accelerator flows, we do not need a calibration data set to deploy a model to AzurEngine's RPP-R8 accelerator.

This leads to several additional advantages:

There is no need for calibration data, since quantization is not required
There is no need for a GPU, since quantization is not required

Model Conversion and Optimization

As we saw previously in the "AzurEngine SDK Overview" section, the deployment phase starts with a conversion to ONNX format.

The first phase is to convert the TFLite models to ONNX, using the tf2onnx utility.

The following excerpt demonstrates the conversion, as well as model optimizations that are performed for the "v0.10 full" version of the palm detection and hand landmark models.

python3 -m tf2onnx.convert --opset 12 --tflite ../../blaze_tflite/models/palm_detection_full.tflite --output palm_detection_full.onnx
2025-11-17 20:57:10.048307: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-17 20:57:10.082854: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-17 20:57:10.083304: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-17 20:57:10.568500: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/azurengine/miniconda3/envs/az_samples_env_tria_mediapipe/lib/python3.8/runpy.py:127: RuntimeWarning: 'tf2onnx.convert' found in sys.modules after import of package 'tf2onnx', but prior to execution of 'tf2onnx.convert'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
2025-11-17 20:57:11.563730: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2025-11-17 20:57:11,565 - INFO - Using tensorflow=2.13.1, onnx=1.16.2, tf2onnx=1.16.1/15c810
2025-11-17 20:57:11,565 - INFO - Using opset <onnx, 12>
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
2025-11-17 20:57:11,877 - INFO - Optimizing ONNX model
2025-11-17 20:57:12,536 - INFO - After optimization: Cast -137 (137->0), Const -35 (175->140), Identity -2 (2->0), Reshape -28 (32->4), Transpose -259 (264->5)
2025-11-17 20:57:12,555 - INFO - 
2025-11-17 20:57:12,555 - INFO - Successfully converted TensorFlow model ../../blaze_tflite/models/palm_detection_full.tflite to ONNX
2025-11-17 20:57:12,556 - INFO - Model inputs: ['input_1']
2025-11-17 20:57:12,556 - INFO - Model outputs: ['Identity', 'Identity_1']
2025-11-17 20:57:12,556 - INFO - ONNX model is saved at palm_detection_full.onnx




python3 -m tf2onnx.convert --opset 12 --tflite ../../blaze_tflite/models/hand_landmark_full.tflite --output hand_landmark_full.onnx
2025-11-17 20:57:16.244487: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-17 20:57:16.279447: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-17 20:57:16.279853: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-17 20:57:16.765567: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/azurengine/miniconda3/envs/az_samples_env_tria_mediapipe/lib/python3.8/runpy.py:127: RuntimeWarning: 'tf2onnx.convert' found in sys.modules after import of package 'tf2onnx', but prior to execution of 'tf2onnx.convert'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
2025-11-17 20:57:17.765770: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2025-11-17 20:57:17,767 - INFO - Using tensorflow=2.13.1, onnx=1.16.2, tf2onnx=1.16.1/15c810
2025-11-17 20:57:17,767 - INFO - Using opset <onnx, 12>
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
2025-11-17 20:57:18,064 - INFO - Optimizing ONNX model
2025-11-17 20:57:18,516 - INFO - After optimization: Cast -102 (102->0), Const -79 (183->104), GlobalAveragePool +1 (0->1), Identity -4 (4->0), ReduceMean -1 (1->0), Reshape -16 (16->0), Squeeze +1 (0->1), Transpose -191 (192->1)
2025-11-17 20:57:18,548 - INFO - 
2025-11-17 20:57:18,548 - INFO - Successfully converted TensorFlow model ../../blaze_tflite/models/hand_landmark_full.tflite to ONNX
2025-11-17 20:57:18,548 - INFO - Model inputs: ['input_1']
2025-11-17 20:57:18,548 - INFO - Model outputs: ['Identity', 'Identity_1', 'Identity_2', 'Identity_3']
2025-11-17 20:57:18,548 - INFO - ONNX model is saved at hand_landmark_full.onnx

The second phase is to further optimize the model using the onnxsim utility.

The following excerpt demonstrates the additional optimizations that are performed for the "v0.10 full" version of the palm detection and hand landmark models.

onnxsim ../../blaze_onnx/models/palm_detection_full.onnx palm_detection_full_sim.onnx
Simplifying...
Finish! Here is the difference:
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃            ┃ Original Model ┃ Simplified Model ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ Add        │ 30             │ 30               │
│ Concat     │ 2              │ 2                │
│ Conv       │ 63             │ 63               │
│ MaxPool    │ 4              │ 4                │
│ PRelu      │ 31             │ 31               │
│ Pad        │ 3              │ 3                │
│ Reshape    │ 4              │ 4                │
│ Resize     │ 2              │ 2                │
│ Transpose  │ 5              │ 5                │
│ Model Size │ 4.4MiB         │ 4.4MiB           │
└────────────┴────────────────┴──────────────────┘


onnxsim ../../blaze_onnx/models/hand_landmark_full.onnx hand_landmark_full_sim.onnx
Simplifying...
Finish! Here is the difference:
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃                   ┃ Original Model ┃ Simplified Model ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ Add               │ 9              │ 9                │
│ Clip              │ 32             │ 32               │
│ Conv              │ 47             │ 47               │
│ Gemm              │ 4              │ 4                │
│ GlobalAveragePool │ 1              │ 1                │
│ Sigmoid           │ 2              │ 2                │
│ Squeeze           │ 1              │ 1                │
│ Transpose         │ 1              │ 1                │
│ Model Size        │ 10.4MiB        │ 10.4MiB          │
└───────────────────┴────────────────┴──────────────────┘

Model Execution

In order to support execution on RPP, I will be using the "blaze_app_python".

The "blaze_app_python" application is a sub-set of the mediapipe framework, allowing inference of individual models. This allows to run inference on various target hardware, and use a common code base for profiling.

This application was augmented with the following inference targets:

blaze_onnx : in order to run inference with the ONNX runtime
blaze_rpp : in order to run inference with the RPP runtime

blaze_app_python - support for TFLite, ONNX, and RPP (📷: AlbertaBeef)

My final inference code for RPP inference can be found in the "blaze_app_python" repository, under the blaze_rpp sub-directory:

We will also use the "hand_controller" application, which builds on top of "blaze_app_python", to implement a sign language recognition application.

Installing the AzurEngine SDK

AzurEngine provides documentation for installing their SDK, their RPP linux kernel driver, and their RPP runtime.

Once installed, we launch the conda environment that was created during installation of the SDK:

conda activate az_samples_env

Installing the python application

The python demo application requires certain packages which can be installed as follows:

pip3 install tensorflow 
pip3 install onnx onnxruntime
pip3 install tf2onnx onnxsim

The python application can be accessed from the following github repository:

git clone --recursive https://github.com/AlbertaBeef/hand_controller
cd hand_controller/blaze_app_python

In order to successfully use the python demo with the original TFLite models, they need to be downloaded from the google web site:

cd blaze_tflite/models
source ./get_tflite_models.sh
cd ../..

Convert the TFLite models to ONNX, as follows:

cd blaze_onnx/models
source ./convert_models.sh
cd ../..

Convert the ONNX models for use with RPP, as follows:

cd blaze_rpp/models
source ./convert_models.sh
cd ../..

You are all set !

Launching the python application

The python application can launch many variations of the dual-inference pipeline, which can be filtered with the following arguments:

--blaze : hand | face | pose
--target : blaze_tflite |... | blaze_onnx | blaze_rpp
--pipeline : specific name of pipeline (can be queried with --list argument)

In order to display the complete list of supported pipelines, launch the python script as follows:

$ python3 blaze_detect_live.py --list
[INFO] user@hosthame :  azurengine@azurengine
[INFO] blaze_tflite supported ...
...
[INFO] blaze_onnx supported ...
[INFO] blaze_rpp supported ...
Command line options:
 --input       :  
 --testimage   :  False
 --blaze       :  hand,face,pose
 --target      :  blaze_tflite,blaze_tflite_quant,blaze_pytorch,blaze_vitisai,blaze_hailo,blaze_onnx,blaze_rpp
 --pipeline    :  all
 --list        :  True
 --verbose     :  False
 --debug       :  False
 --withoutview :  False
 --profilelog  :  False
 --profileview :  False
 --fps         :  False

List of target pipelines:
...
11 rpp_hand_v0_07            blaze_rpp/models/palm_detection_v0_07_sim.onnx
                             blaze_rpp/models/hand_landmark_v0_07_sim.onnx
12 rpp_hand_v0_10_lite       blaze_rpp/models/palm_detection_lite_sim.onnx
                             blaze_rpp/models/hand_landmark_lite_sim.onnx
13 rpp_hand_v0_10_full       blaze_rpp/models/palm_detection_full_sim.onnx
                             blaze_rpp/models/hand_landmark_full_sim.onnx
...
27 rpp_face_v0_07_front      blaze_rpp/models/face_detection_front_v0_07_sim.onnx
                             blaze_rpp/models/face_landmark_v0_07_sim.onnx
28 rpp_face_v0_07_back       blaze_rpp/models/face_detection_back_v0_07_sim.onnx
                             blaze_rpp/models/face_landmark_v0_07_sim.onnx
29 rpp_face_v0_10_short      blaze_rpp/models/face_detection_short_range_sim.onnx
                             blaze_rpp/models/face_landmark_sim.onnx
30 rpp_face_v0_10_full       blaze_rpp/models/face_detection_full_range_sim.onnx
                             blaze_rpp/models/face_landmark_sim.onnx
...
43 rpp_pose_v0_10_lite       blaze_tflite/models/pose_detection.tflite
                             blaze_rpp/models/pose_landmark_lite_sim.onnx
44 rpp_pose_v0_10_full       blaze_tflite/models/pose_detection.tflite
                             blaze_rpp/models/pose_landmark_full_sim.onnx
45 rpp_pose_v0_10_heavy      blaze_tflite/models/pose_detection.tflite
                             blaze_rpp/models/pose_landmark_heavy_sim.onnx

In order to launch the RPP pipeline for palm detection and hand landmarks, with the monitor, use the python script as follows::

python3 blaze_detect_live.py --pipeline=rpp_hand_v0_10_lite

This will launch the 0.10 (lite) version of the models, as shown below:

python3 blaze_detect_live.py --pipeline=rpp_hand_v0_10_lite (📹 : AlbertaBeef)

The previous video has not been accelerated. It shows the frame rate to be 30 fps when no hands are detected (one model running : palm detection), as well as when one hand has been detected (two models running : palm detection and hand landmarks), and when two hands have been detected (three models running : palm detection and 2 hand landmarks).

In order to know the true performance of the models running on the RPP accelerator, we will need to detach from the USB camera (which is determining the frame rate of 30fps). We will be doing this in the next section.

Benchmarking the models

The profiling functionality uses a test image that can be downloaded from Google as follows:

source ./get_test_images.sh

The following commands can be used to generate profile results for the rpp_hand_v0_10_lite pipeline with the RPP runtime, and the test image:

rm blaze_detect_live.csv
python3 blaze_detect_live.py --pipeline=rpp_hand_v0_10_lite --image --withoutview --profilelog
mv blaze_detect_live.csv blaze_detect_live_ryzen5600gt_rpp_hand_v0_10_lite.csv

The following commands can be used to generate profile results for the tfl_hand_v0_10_lite pipeline using the TFLite models, and the test image:

rm blaze_detect_live.csv
python3 blaze_detect_live.py --pipeline=tfl_hand_v0_10_lite --image --withoutview --profilelog
mv blaze_detect_live.csv blaze_detect_live_ryzen5600gt_tfl_hand_v0_10_lite.csv

The same is done for the other models for each mediapipe pipeline.

The results of all.csv files were averaged, then plotted using Excel.

Here are the profiling results for the 0.07 and 0.10 versions of the models deployed with RPP, in comparison to the reference TFLite models:

Latency Benchmarks - RPP versus TFLite Inference (📷: AlbertaBeef)

RPP Acceleration - Palm Detection + Hand Landmarks (📷: AlbertaBeef)

It is worth noting that these benchmarks have been taken with a single-threaded python script. There is additional opportunity for acceleration with a multi-threaded implementation. While the graph runner is waiting for transfers from one model's sub-graphs, another (or several other) model(s) could be launched in parallel...

There is also an opportunity to accelerate the rest of the pipeline with C++ code...

Competitive Comparison

This section compares the performance of the AzurEngine K08-1, compared to similar M.2 modules from other vendors, specifically using my blaze_app_python application:

AzurEngine : K08-1 (M.2 M-Key, 4 lanes)
Hailo : Hailo-8 (M.2 M-Key, 4 lanes)
MemryX : MX3 (M.2 M-Key, 2 lanes)

Before comparing the acceleration achieved for each of these M.2 AI accelerators, we need to define a common baseline.

Latency Benchmarks - TFLite "baseline" reference (📷: AlbertaBeef)

Although my Hailo-8 and MemryX benchmarks were taking on a HP Z4 G4 workstation (with Intel Xeon), and the AzurEngine benchmarks were taken on a TUF Gaming workstation (with Ryzen 5 5600GT), the reference TFLite models have almost identical execution times.

We can compare the Latency Benchmarks for our three M.2 AI accelerator cards:

Latency Benchmarks - Hailo-8 versus MX3 versus RPP-R8 (📷: AlbertaBeef)

Remembering that these our latency benchmarks (execution times), and that shorter is better, we can quickly observe that AzurEngine's RPP-R8 accelerator is achieving from 3X up to 6X more acceleration than the other two accelerators for individual models.

Acceleration - Hailo-8 versus MX3 versus RPP-R8 (📷: AlbertaBeef)

Known Issues

The current version of this project has models for the following pipelines:

palm detection : working
hand landmarks : working
face detection : working
face landmarks : working
pose detection : not supported (pose detection model does not convert)
pose landmarks : working

The current platforms have been tested:

TUF Gaming PC (Ryzen 5 5600GT), with K08-1 (M Key, 4 lanes)

Exception Support

I want to mention some issues I ran into, only to emphasize the fact that these were quickly resolved by the AzurEngine team.

Unsupported option in PAD layer

As with many of the other vendors I explored, the padding in the channel dimension was initially not supported.

SDK v1.6.0.8.2_R9 - Unsupported Layers - Overview (📷: AlbertaBeef)

SDK v1.6.0.8.2_R9 - Unsupported Layers - Padding in front/back direction (📷: AlbertaBeef)

In contrast with the other vendors, however, AzurEngine quickly added support for this option.

This was resolved as an update in SDK v1.6.11.1.

For anyone having dealt with tool issues, the quick turn-around is a testimony to AzurEngine's customer support and engineering efficiency !

Latency during data transfers

Although not an issue, per say, this showed up with the Python API, which was using a normal malloc instead of a dedicated allocation for the AzurEngine runtime.

One solution would have been to abandon the python implementation for a C++ implementation. This, however, was not required as AzurEngine's engineering team quick updated the python API.

This was resolved as an update in SDK v1.6.11.7.

Another testimony to AzurEngine's customer support and engineering efficiency !

When I thanked AzurEngine for their exceptional support, they responded : "Fixing these issues was actually very easy. Thanks to our configuration architecture, we can iterate very quickly on new features."

Considering that today, state-of-the-art (SOTA) means new models and architectures on a weekly basis, this is very comforting to know.

Conclusion

I only scratched the surface of AzurEngine's functionnality. There are so may other features to explore and discover, of which the most interesting are:

LLM execution
CUDA compatibility

For which embedded platforms would you like to see the AzurEngine AI acceleration module supported ?

Let me know if the comments...

Acknowledgements

I want to thank AzurEngine for the opportunity to evaluate their RPP-R8 accelerator, and the phenomenal support provided during this exploration.

Version History

2025/12/29 - Initial Version
2026/01/10 - Added Competitive Comparison
2025/10/20 - Published

References

Hand Controller (by Mario Bergeron):

Hand Controller (python version) : hand_controller

Accelerating MediaPipe (by Mario Bergeron):

Hackster Series Part 1 : Blazing Fast Models
Hackster Series Part 2 : Insightful Datasets for ASL recognition
Hackster Series Part 3 : Accelerating the MediaPipe models with Vitis-AI 3.5
Hackster Series Part 4 : Accelerating the MediaPipe models with Hailo-8
Hackster Series Part 5 : Accelerating the MediaPipe models on RPI5 AI Kit
Hackster Series Part 6 : Accelerating the MediaPipe models with MemryX
Hackster Series Part 7 : Accelerating the MediaPipe models with Qualcomm
Hackster Series Part 8 : Accelerating the MediaPipe models with AzurEngine
Blaze Utility (python version) : blaze_app_python
Blaze Utility (C++ version) : blaze_app_cpp

ASL Recognition using PointNet (by Edward Roe):

Medium Article : ASL Recognition using PointNet and MediaPipe
Kaggle Dataset : American Sign Language Dataset
GitHub Source : pointnet_hands

Mario Bergeron

64 projects • 305 followers

Mario Bergeron specializes in embedded vision, machine learning, and robotics.

Accelerating the MediaPipe models with AzurEngine

Story

Introduction

AzurEngine SDK Overview

Challenges of deploying MediaPipe models

Creating a Calibration Dataset for Quantization

Model Conversion and Optimization

Model Execution

Installing the AzurEngine SDK

Installing the python application

Launching the python application

Benchmarking the models

Competitive Comparison

Known Issues

Exception Support

Conclusion

Acknowledgements

Version History

References

Credits

Mario Bergeron

Comments

Embed the widget on your own site

Accelerating the MediaPipe models with AzurEngine

Accelerating the MediaPipe models with AzurEngine

Story

Introduction

AzurEngine SDK Overview

Challenges of deploying MediaPipe models

Creating a Calibration Dataset for Quantization

Model Conversion and Optimization

Model Execution

Installing the AzurEngine SDK

Installing the python application

Launching the python application

Benchmarking the models

Competitive Comparison

Known Issues

Exception Support

Conclusion

Acknowledgements

Version History

References

Credits

Mario Bergeron

Comments

Related channels and tags