This is the first project of a series on the subject of deploying the MediaPipe models to the edge on embedded platforms.
I will start by giving a general overview of the MediaPipe models, and some of the challenges that can be expected when running on embedded hardware.
Then I will dive into the architecture of its most popular models (detection and landmarks for hands, face, and body pose).
Finally, I will introduce a custom python application that I created as a tool for running, debugging, and profiling alternate implementations of these models.
Status of the MediaPipe modelsMediaPipe is a framework created by Google that implements detection for the following applications:
- palm detection and hand landmarks
- face detection and landmarks
- pose detection and landmarks
- etc...
The MediaPipe framework supports/requires the following tools:
- TF-Lite (models)
- Bazel (build tools)
The first public versions of the MediaPipe models appeared in 2019, which can be considered a very long time ago with respect to AI/ML's rapid evolution.
The latest version of the MediaPipe framework, as of this writing, is v0.10.xx, which is still considered a development version.
Since its first public release, Google has made continual improvements to its MediaPipe framework.
One of the most notable improvements that was documented in 2021 was the improvement in hand landmarks for individual fingers, especially important for sign language applications.
In the latest version of the framework (0.10.14), Google introduced a new MediaPipe LLM Inference API enabling select large language models (Gemma 2B, Falcon 1B, Phi 2, and Stable LM 3B).
It is interesting that Google have never published a v1.00 version, which would have been considered stable... most probably because they are continually improving the models and framework.
Challenges with the MediaPipe modelsIf using the MediaPipe framework on a modern computer, all that is required is a "pip install mediapipe", and everything works out of the box !
The requirement to re-build the MediaPipe framework for a custom target, however, may be a challenge if not familiar with the Bazel build tools.
Re-training the models is not possible, since the dataset(s) used to train these models was never published by Google.
Deployment to specialized targets, which usually requires a sub-set of the training dataset presents the challenge of coming up with this data yourself.
Furthermore, many targets do not support the TF-Lite framework, so the models need to be converted to an alternate framework prior to deployment.
One example of this is OpenVINO, where the models are first converted to TensorFlow2 using PINTO's model conversion utilities, before being fed into the OpenVINO framework.
Another example I ran into was deploying to Vitis-AI, where I had to work with models that had been converted to Pytorch. But this is a subject for another project...
Still Relevant TodayIn 2023, Google launched two open-source competitions on Kaggle, totaling $300, 000 in prizes.
Isolated Sign Language Recognition
- https://www.kaggle.com/competitions/asl-signs/overview
- Feb-May 2023
- $100, 000 Prize
Fingerspelling Recognition
- https://www.kaggle.com/competitions/asl-fingerspelling
- May-Aug 2023
- $200, 000 Prize
It is very interesting to analyze the winning solution for the Fingerspelling Recognition competition:
The input to the solution are the following subset of 130 landmarks:
- 21 key points from each hand
- 6 pose key points from each arm
- 76 from the face (lips, nose, eyes)
In other words, the output of the MediaPipe framework, the hand, face, and pose landmarks are being used as input to the more complex problem of sign language recognition.
Architecture OverviewIt should be clear by now that these pipelines are implemented with two models, and can thus be called multi-inference pipelines. If we take the example of the palm detector and hand landmarks, we have the following dual-inference pipeline:
The pre-processing to this pipeline consists of image normalization, resizing, and padding, as shown below:
Note that the input image size can be different for the different pipelines (hand, face, pose).
The first model, the detector model will return a region of interest (ROI), as well as several keypoints. The post-processing to the first model is shown below:
The output of the sigmoid function can be seen in the following debug window, along with the minimum score threshold used to extract the initial regions of interest corresponding to the pre-defined anchor boxes.
The minimum score threshold determines which bounding boxes are retained. These candidate bounding boxes are combined with their pre-defined anchor boxes to determine the real coordinates of the bounding boxes, together with corresponding keypoints.
The remaining bounding boxes then go through an NMS algorithm, which eliminates duplicate overlapping ROIs.
The pre-processing to the second landmark model is a cropping operation implemented with an affine transform (warp), as shown below:
The detector keypoints are used to determine the angle and scale of the extracted ROIs, such that their content is positioned upright. In order to better visualize how this is done, refer to the following illustrations:
A similar concept is also used by the face and pose pipelines, although the calculations are different. As an example, the following illustrations demonstrate how this is done for the face pipeline:
The post-processing to the landmark model is a de-normalization to the real coordinates of the input image:
The final hand landmarks are illustrated in the figures below:
The landmarks for the other models are illustrated in the figures below:
Before attempting to deploy the MediaPipe models to embedded platforms, I took upon myself to create a purely python implementation of the pre-processing and post-processing, and explicitly running inference of the individual TF-Lite models.
Take the time to appreciate the details captured in the above video:
- the top three windows illustrate overall detection and landmarks for hand, face, and pose
- the bottom three "debug" windows illustrate the regions of interest (ROIs) that are extracted by the detector model, for the landmark model
The ROIs are extracted such that the hands, face, and even human pose are presented in the same orientation and size to the landmark model.
The following diagram illustrates the main components of the "blaze_app" python application, as well as the main components of the dual-inference pipelines. Although the "extract ROI" is part of the Landmarks pre-processing, it is identified separately for profiling reasons.
The "blaze_app" python application provides a common base for performing the following experiments during my exploration:
- Compare output from different versions of same models (0.07 versus 0.10)
- Compare output from alternate versions of the models (PyTorch, …) with the reference versions (TFLite)
- Identify and Profile latency of each component (model inference time, pre-processing, post-processing, etc…)
- View intermediate results (ROIs, detection scores wrt threshold, etc…)
The python application can be accessed from the following github repository:
git clone https://github.com/AlbertaBeef/blaze_app_python
cd blaze_app_python
In order to successfully use the python demo with the original TFLite models, they need to be downloaded from the google web site:
cd blaze_tflite/models
source ./get_tflite_models.sh
cd ../..
You are all set !
Launching the python applicationThe python application can launch many variations of the dual-inference pipeline, which can be filtered with the following arguments:
- --blaze : hand | face | pose
- --target : blaze_tflite | {others not listed for now}
- --pipeline : specific name of pipeline (can be queried with --list argument)
In order to display the complete list of supported pipelines, launch the python script as follows:
python blaze_detect_live.py --list
...
[INFO] blaze_tflite supported ...
...
Command line options:
--input :
--image : False
--blaze : hand,face,pose
--target : blaze_tflite,blaze_pytorch,blaze_vitisai,blaze_hailo
--pipeline : all
--list : True
--debug : False
--withoutview : False
--profilelog : False
--profileview : False
--fps : False
List of target pipelines:
00 tfl_hand_v0_07 blaze_tflite/models/palm_detection_without_custom_op.tflite
blaze_tflite/models/hand_landmark_v0_07.tflite
01 tfl_hand_v0_10_lite blaze_tflite/models/palm_detection_lite.tflite
blaze_tflite/models/hand_landmark_lite.tflite
02 tfl_hand_v0_10_full blaze_tflite/models/palm_detection_full.tflite
blaze_tflite/models/hand_landmark_full.tflite
...
In order to launch the TFLite pipelines (--target=blaze_tflite), for hand detection and landmarks (--blaze=hand), use the python script as follows::
python3 blaze_detect_live.py --target=blaze_tflite --blaze=hand
This will launch the 0.07 version of the models, as well as the light and full 0.10 versions, as shown below:
Notice the differences in accuracy between the 0.07 and 0.10 versions, specifically on my left hand.
To launch the full 0.10 version only, with the debug window enabled, and profiling, launch the python script as follows:
python3 blaze_detect_live.py --pipeline=tfl_hand_v0_10_full --debug --profileview
This will launch the full 0.10 version of the hand pipeline, as well as a debug window and two profiling windows (latency & performance), as shown below:
The "--profilelog" argument can be used to log profiling information to a csv file for further processing. The following displays the average latency of all components in the pipeline for three versions of the hand landmarks pipeline:
- version 0.07
- version 0.10 (full)
- version 0.10 (lite)
NOTE : the profiling feature of this single-threaded python application should not be used as a reliable benchmarking tool
As can be seen in the above examples, this python script is meant to be a test bench for validating various implementations of the mediapipe models, including:
- alternate frameworks (such as PyTorch)
- embedded inference (such as Vitis-AI, Hailo-8, etc...)
These examples will be explored in future projects...
How Fast is Fast ?On modern computers, the MediaPipe framework runs VERY fast !
On embedded platforms, however, things quickly break down, resulting in one frame per second in some cases...
The following benchmarks were gathered with my single-threaded python-only implementation in order identify where time is being spent for different components on various platforms:
On an embedded ARM A63 processing, the Palm Detection and Hand Landmark models run 3x slower than a modern laptop, and 20x slower than a modern workstation !
This is a case, where the models cannot be used out of the box, they need to be further accelerated. But once again, this is a subject for another project...
ConclusionI hope this project will inspire you to implement your own custom application.
What applications would you like to see built on top of these foundational MediaPipe models ?
Do you have interest in deploying the MediaPipe models to embedded platforms ? If yes, which one ?
Are you aware of datasets that can be used to re-train any of the hand/face/pose models ?
Let me know if the comments...
AcknowledgementsI want to thank Google (https://www.hackster.io/google) for making the following available publicly:
- MediaPipe
- 2024/08/04 - Initial Version
- [Google] MediaPipe Solutions Guide : https://ai.google.dev/edge/mediapipe/solutions/guide
- [Google] MediaPipe Source Code : https://github.com/google-ai-edge/mediapipe
- [Google] SignALL SDK : https://developers.googleblog.com/en/signall-sdk-sign-language-interface-using-mediapipe-is-now-available-for-developers/
- [Kaggle] Isolated Sign Language Recognition : https://www.kaggle.com/competitions/asl-signs/overview
- [Kaggle] Fingerspelling Recognition : https://www.kaggle.com/competitions/asl-fingerspelling
- [AlbertaBeef] blaze_app_python : https://github.com/AlbertaBeef/blaze_app_python
Comments