Published August 4, 2024 © Apache-2.0

Blazing Fast Models

An exploration of the mediapipe models, and how they are still relevant today.

BeginnerFull instructions provided1 hour2,308

Things used in this project

Software apps and online services

OpenCV

Google MediaPipe

Story

Introduction

This is the first project of a series on the subject of deploying the MediaPipe models to the edge on embedded platforms.

I will start by giving a general overview of the MediaPipe models, and some of the challenges that can be expected when running on embedded hardware.

Then I will dive into the architecture of its most popular models (detection and landmarks for hands, face, and body pose).

Finally, I will introduce a custom python application that I created as a tool for running, debugging, and profiling alternate implementations of these models.

Status of the MediaPipe models

MediaPipe is a framework created by Google that implements detection for the following applications:

palm detection and hand landmarks
face detection and landmarks
pose detection and landmarks
etc...

MediaPipe models (📷: Google)

The MediaPipe framework supports/requires the following tools:

TF-Lite (models)
Bazel (build tools)

The first public versions of the MediaPipe models appeared in 2019, which can be considered a very long time ago with respect to AI/ML's rapid evolution.

The latest version of the MediaPipe framework, as of this writing, is v0.10.xx, which is still considered a development version.

Since its first public release, Google has made continual improvements to its MediaPipe framework.

One of the most notable improvements that was documented in 2021 was the improvement in hand landmarks for individual fingers, especially important for sign language applications.

MediaPipe hand landmarks model improvements (📹: Google)

In the latest version of the framework (0.10.14), Google introduced a new MediaPipe LLM Inference API enabling select large language models (Gemma 2B, Falcon 1B, Phi 2, and Stable LM 3B).

It is interesting that Google have never published a v1.00 version, which would have been considered stable... most probably because they are continually improving the models and framework.

Challenges with the MediaPipe models

If using the MediaPipe framework on a modern computer, all that is required is a "pip install mediapipe", and everything works out of the box !

The requirement to re-build the MediaPipe framework for a custom target, however, may be a challenge if not familiar with the Bazel build tools.

Re-training the models is not possible, since the dataset(s) used to train these models was never published by Google.

Deployment to specialized targets, which usually requires a sub-set of the training dataset presents the challenge of coming up with this data yourself.

Furthermore, many targets do not support the TF-Lite framework, so the models need to be converted to an alternate framework prior to deployment.

One example of this is OpenVINO, where the models are first converted to TensorFlow2 using PINTO's model conversion utilities, before being fed into the OpenVINO framework.

Another example I ran into was deploying to Vitis-AI, where I had to work with models that had been converted to Pytorch. But this is a subject for another project...

Still Relevant Today

In 2023, Google launched two open-source competitions on Kaggle, totaling $300, 000 in prizes.

Isolated Sign Language Recognition

https://www.kaggle.com/competitions/asl-signs/overview
Feb-May 2023
$100, 000 Prize

[Kaggle] Isolated Sign Language Recognition (📷: Kaggle)

Fingerspelling Recognition

https://www.kaggle.com/competitions/asl-fingerspelling
May-Aug 2023
$200, 000 Prize

[Kaggle] Fingerspelling Recognition (📷: Kaggle)

It is very interesting to analyze the winning solution for the Fingerspelling Recognition competition:

https://www.kaggle.com/competitions/asl-fingerspelling/discussion/434485

[1st place solution] Improved Squeezeformer + TransformerDecoder + Clever augmentations (📷: Kaggle)

The input to the solution are the following subset of 130 landmarks:

21 key points from each hand
6 pose key points from each arm
76 from the face (lips, nose, eyes)

In other words, the output of the MediaPipe framework, the hand, face, and pose landmarks are being used as input to the more complex problem of sign language recognition.

Architecture Overview

It should be clear by now that these pipelines are implemented with two models, and can thus be called multi-inference pipelines. If we take the example of the palm detector and hand landmarks, we have the following dual-inference pipeline:

Dual-Inference Pipeline : Palm Detection + Hand Landmarks (📷: AlbertaBeef)

The pre-processing to this pipeline consists of image normalization, resizing, and padding, as shown below:

Pre-Processing pipeline for Palm detector (📷: AlbertaBeef)

Note that the input image size can be different for the different pipelines (hand, face, pose).

The first model, the detector model will return a region of interest (ROI), as well as several keypoints. The post-processing to the first model is shown below:

Post-Processing pipeline for Palm detector (📷: AlbertaBeef)

The output of the sigmoid function can be seen in the following debug window, along with the minimum score threshold used to extract the initial regions of interest corresponding to the pre-defined anchor boxes.

Detection Scores (📷: AlbertaBeef)

The minimum score threshold determines which bounding boxes are retained. These candidate bounding boxes are combined with their pre-defined anchor boxes to determine the real coordinates of the bounding boxes, together with corresponding keypoints.

The remaining bounding boxes then go through an NMS algorithm, which eliminates duplicate overlapping ROIs.

The pre-processing to the second landmark model is a cropping operation implemented with an affine transform (warp), as shown below:

Pre-Processing pipeline for Hand landmarks (📷: AlbertaBeef)

The detector keypoints are used to determine the angle and scale of the extracted ROIs, such that their content is positioned upright. In order to better visualize how this is done, refer to the following illustrations:

1 / 2 • Hand Pipeline - Detector Keypoints => Final ROI (📷: AlbertaBeef)

A similar concept is also used by the face and pose pipelines, although the calculations are different. As an example, the following illustrations demonstrate how this is done for the face pipeline:

1 / 2 • Face Pipeline - Detector Keypoints => Final ROI (📷: AlbertaBeef)

The post-processing to the landmark model is a de-normalization to the real coordinates of the input image:

Post-Processing pipeline for Hand landmarks (📷: AlbertaBeef)

The final hand landmarks are illustrated in the figures below:

Hand Landmarks (📷: Google)

The landmarks for the other models are illustrated in the figures below:

Face Landmarks (ie. Tesselation) (📷: Google)

Full Body Pose Landmarks (📷: Google)

Upper Body Pose Landmarks (📷: Google, edited by AlbertaBeef)

Re-Inventing the wheel with python

Before attempting to deploy the MediaPipe models to embedded platforms, I took upon myself to create a purely python implementation of the pre-processing and post-processing, and explicitly running inference of the individual TF-Lite models.

https://github.com/AlbertaBeef/blaze_app_python

blaze_app_python - Example Output (📹: AlbertaBeef)

Take the time to appreciate the details captured in the above video:

the top three windows illustrate overall detection and landmarks for hand, face, and pose
the bottom three "debug" windows illustrate the regions of interest (ROIs) that are extracted by the detector model, for the landmark model

The ROIs are extracted such that the hands, face, and even human pose are presented in the same orientation and size to the landmark model.

The following diagram illustrates the main components of the "blaze_app" python application, as well as the main components of the dual-inference pipelines. Although the "extract ROI" is part of the Landmarks pre-processing, it is identified separately for profiling reasons.

blaze_app_python - Block Diagram (📷: AlbertaBeef)

The "blaze_app" python application provides a common base for performing the following experiments during my exploration:

Compare output from different versions of same models (0.07 versus 0.10)
Compare output from alternate versions of the models (PyTorch, …) with the reference versions (TFLite)
Identify and Profile latency of each component (model inference time, pre-processing, post-processing, etc…)
View intermediate results (ROIs, detection scores wrt threshold, etc…)

Installing the python application

The python application can be accessed from the following github repository:

git clone https://github.com/AlbertaBeef/blaze_app_python
cd blaze_app_python

In order to successfully use the python demo with the original TFLite models, they need to be downloaded from the google web site:

cd blaze_tflite/models
source ./get_tflite_models.sh
cd ../..

You are all set !

Launching the python application

The python application can launch many variations of the dual-inference pipeline, which can be filtered with the following arguments:

--blaze : hand | face | pose
--target : blaze_tflite | {others not listed for now}
--pipeline : specific name of pipeline (can be queried with --list argument)

In order to display the complete list of supported pipelines, launch the python script as follows:

python blaze_detect_live.py --list
...
[INFO] blaze_tflite supported ...
...
Command line options:
--input       :
--image       :  False
--blaze       :  hand,face,pose
--target      :  blaze_tflite,blaze_pytorch,blaze_vitisai,blaze_hailo
--pipeline    :  all
--list        :  True
--debug       :  False
--withoutview :  False
--profilelog  :  False
--profileview :  False
--fps         :  False

List of target pipelines:
00 tfl_hand_v0_07       blaze_tflite/models/palm_detection_without_custom_op.tflite
                        blaze_tflite/models/hand_landmark_v0_07.tflite
01 tfl_hand_v0_10_lite  blaze_tflite/models/palm_detection_lite.tflite
                        blaze_tflite/models/hand_landmark_lite.tflite
02 tfl_hand_v0_10_full  blaze_tflite/models/palm_detection_full.tflite
                        blaze_tflite/models/hand_landmark_full.tflite
...

In order to launch the TFLite pipelines (--target=blaze_tflite), for hand detection and landmarks (--blaze=hand), use the python script as follows::

python3 blaze_detect_live.py --target=blaze_tflite --blaze=hand

This will launch the 0.07 version of the models, as well as the light and full 0.10 versions, as shown below:

python3 blaze_detect_live.py --target=blaze_tflite --blaze=hand (📷: AlbertaBeef)

Notice the differences in accuracy between the 0.07 and 0.10 versions, specifically on my left hand.

To launch the full 0.10 version only, with the debug window enabled, and profiling, launch the python script as follows:

python3 blaze_detect_live.py --pipeline=tfl_hand_v0_10_full --debug --profileview

This will launch the full 0.10 version of the hand pipeline, as well as a debug window and two profiling windows (latency & performance), as shown below:

python3 blaze_detect_live.py --pipeline=tfl_hand_v0_10_full --debug --profileview (📷: AlbertaBeef)

The "--profilelog" argument can be used to log profiling information to a csv file for further processing. The following displays the average latency of all components in the pipeline for three versions of the hand landmarks pipeline:

version 0.07
version 0.10 (full)
version 0.10 (lite)

Palm Detection + Hand Landmarks pipeline - benchmarks for various versions (📷: AlbertaBeef)

NOTE : the profiling feature of this single-threaded python application should not be used as a reliable benchmarking tool

As can be seen in the above examples, this python script is meant to be a test bench for validating various implementations of the mediapipe models, including:

alternate frameworks (such as PyTorch)
embedded inference (such as Vitis-AI, Hailo-8, etc...)

These examples will be explored in future projects...

How Fast is Fast ?

On modern computers, the MediaPipe framework runs VERY fast !

On embedded platforms, however, things quickly break down, resulting in one frame per second in some cases...

The following benchmarks were gathered with my single-threaded python-only implementation in order identify where time is being spent for different components on various platforms:

Palm Detection + Hand Landmarks pipeline (v0.10 lite) — benchmarks on various platforms (📷: AlbertaBeef)

Palm Detection + Hand Landmarks pipeline (v0.07) — benchmarks on various platforms (📷: AlbertaBeef)

On an embedded ARM A63 processing, the Palm Detection and Hand Landmark models run 3x slower than a modern laptop, and 20x slower than a modern workstation !

This is a case, where the models cannot be used out of the box, they need to be further accelerated. But once again, this is a subject for another project...

Conclusion

I hope this project will inspire you to implement your own custom application.

What applications would you like to see built on top of these foundational MediaPipe models ?

Do you have interest in deploying the MediaPipe models to embedded platforms ? If yes, which one ?

Are you aware of datasets that can be used to re-train any of the hand/face/pose models ?

Let me know if the comments...

Acknowledgements

I want to thank Google (https://www.hackster.io/google) for making the following available publicly:

MediaPipe

Version History

2024/08/04 - Initial Version

References

[Google] MediaPipe Solutions Guide : https://ai.google.dev/edge/mediapipe/solutions/guide
[Google] MediaPipe Source Code : https://github.com/google-ai-edge/mediapipe
[Google] SignALL SDK : https://developers.googleblog.com/en/signall-sdk-sign-language-interface-using-mediapipe-is-now-available-for-developers/
[Kaggle] Isolated Sign Language Recognition : https://www.kaggle.com/competitions/asl-signs/overview
[Kaggle] Fingerspelling Recognition : https://www.kaggle.com/competitions/asl-fingerspelling
[AlbertaBeef] blaze_app_python : https://github.com/AlbertaBeef/blaze_app_python

Credits

Mario Bergeron

62 projects • 301 followers

Mario Bergeron is a Technical Marketing Engineer working at Tria, specializing in embedded vision and machine learning.

Embed the widget on your own site

Blazing Fast Models

Blazing Fast Models

Things used in this project

Software apps and online services

Story

Introduction

Status of the MediaPipe models

Challenges with the MediaPipe models

Still Relevant Today

Architecture Overview

Re-Inventing the wheel with python

Installing the python application

Launching the python application

How Fast is Fast ?

Conclusion

Acknowledgements

Version History

References

Code

blaze_app_python

Credits

Mario Bergeron

Comments

Related channels and tags