Bringing Data-Centric AI to the Edge

This article describes a data-centric model improvement strategy for tinyML models.

In my first article on Hackster, I wanted to write about the data-centric approach when tackling computer vision projects. I will introduce the concept and its advantages, as well as talk about how I helped create an automated data curation technique for large-scale image datasets.

Introduction

Edge AI, also know as tinyML, is the field linking AI applications to embedded devices and smart sensors. Unlike cloud-hosted devices AI systems, with edge AI, we leverage resource-constrained devices to run specialized models and directly interact with edge nodes and sensors, performing real-time analytics while ensuring data privacy.

The field of tinyML has been continuously growing in recent years. On one hand, it is becoming technologically speaking more reliable and accessible thanks to the advancement in hardware platforms that can handle ML workflows (Raspberry Pis, NVIDIA Jetsons, etc.), and in model architectures and machine learning frameworks, where models are getting smaller, more powerful, and easier to run. Yet another contributing, crucial factor to this growth is the emergence of open source projects and articles from both experts and hobbyists all around the world, something we can see clearly on the Hackster platform.

A major player in the edge AI space is the EDGE AI FOUNDATION (formerly the tinyML Foundation), who have been doing tremendous work in recent years, bringing together small and big companies, research groups, academicians, and enthusiasts to create an ecosystem of knowledge sharing, networking, collaborations, and contributing across the globe, to promote edge AI adoption across all types of industries. In fact, they are partnering with Hackster for the On the Edge omni campaign to encourage this community to explore new hardware, tools, and challenges.

The EDGE AI FOUNDATION also plays an active role in encouraging tinyML enthusiasts to contribute to research activities by organizing the Wake Vision Challenge. The Wake Vision Challenge was an online competition where individuals needed to contribute solutions that would help the researchers behind the Wake Vision Dataset, a large-scale computer vision dataset, find new techniques to improve their dataset and benchmarks.

The competition consisted of two tracks: the model-centric, where participants needed to provide a model architecture, based on the TensorFlow framework, that performs the best on Wake Vision's test set in terms of model accuracy, but also in latency, where it would be tested on an STMicroelectronics MCU platform. The data-centric track, however, asked the participants to create a data curation or data pre-processing workflow that will train the best MCUNET model. That's where I come in.

I was the winner of the first edition of the data-centric track of the Wake Vision Challenge. The data curation solution I created helped me train a model that ranked first in their leaderboard and was selected by the jury as the best data-centric approach of that contest. Furthermore, the approach, which was designed to be scalable, was actually used by the researchers behind the Wake Vision Dataset to improve its data quality, and it actually helped reduce the error rate of the six million-image dataset to a single digit. Not bad innit?

In the remainder of this article, I will dive into the Wake Vision Dataset a bit more, the challenge, and my solution, which you can also try yourself (with a bit of tweaks).

The Wake Vision Dataset

Led by the authors of the paperWake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications, the dataset was put together to solve the gaps of the few existing tinyML computer vision datasets. They constructed a large-scale dataset of over five million images dedicated to a common task, which is image classification of person versus non-person. The dataset can help develop generalized, lightweight, and low-compute CV models that serve as a wake system for a more complex task.

The Wake Vision Dataset introduces a diverse set of images that will help develop robust models. However, without thorough inspection, such datasets do present several data quality flaws: wrong or missing annotations, redundant images, corrupt or noisy images, lack of diversity, to name a few. And it is the case for the dataset in hand because they mention that they only labeled the validation and test set, a small subset of it, and the rest remained machine-generated.

This is where the data-centric track of the Wake Vision Challenge comes in. By engaging the EDGE AI FOUNDATION's community to inspect this dataset for flaws, mistakes and flaws can be spotted and corrected. It is a novel approach that mimics crowdsourced annotations. But, instead of just relying on tedious manual annotations, the goal was to come up with automated and scalable approaches that can improve the dataset's framework.

In the next part of this article, I will detail my strategy that was selected as the winning solution and go over its implementation using openly available and free AI tools.

Data-centric AI: How to get model gains for real-world applications

Data-centricapproach

The model-centric approach has driven much of the progress in AI, but it often reaches diminishing returns, especially in resource-constrained environments like tinyML, where models must fit within kilobytes of memory and run efficiently on microcontrollers.

The data-centric approach, by contrast, flips the script. Instead of constantly tweaking the model, we focus on improving the quality of the data it learns from. This means fixing mislabeled examples, balancing underrepresented cases, augmenting images to better capture real-world conditions, and filtering out noise. The model stays the same — but because the dataset is better, the model learns more effectively. A great real-world example of this is Tesla's Data Engine. Check out this presentation starting at 8:30 for extra insights.

My strategy

During each project that I've encountered in my career, I always found myself tasked with collecting, analyzing, and filtering the right dataset for each task. As mentioned before, this is a crucial component of any AI project, and data engineers should have a thorough understanding of the datasets they have in hand.

My first step when tackling this challenge was to explore and understand the samples in wake vision, and I decided to start with a relatively small, randomly-selected subset that could show me the potential gaps. I managed to identify two common yet impactful flaws in the dataset that I set out to tackle and solve. The first one was missinggroundtruth labels. The official CSV file containing the entirety of Wake Vision labels was missing nearly 49,000 ground truth annotations, representing 25% of the subset I am working with. This could cause two issues: either we have untouched potential of leveraging these images to improve the model's generalizability, or if they are part of the ML training framework, they will be considered all as background images, e.g. as non-person images, which is not 100% the case.

Labels distribution between person, non-person and unlabeled data

The second issue is even more effective, which is the existence of wrong labels. Using a clustering approach in the embedding space, I managed to detect certain clusters of images that had faulty labels. The creators of the paper, having a limited budget, could only direct human and financial effort in labeling a portion of this massive dataset, leaving a huge margin for error. We can see an example in the image below where a cluster of cat images is labeled as 'person' images. Wrong labels definitely hinder any model's performance trained on this dataset.

Images of cats or dogs labeled as person images

These two major 'flaws' in the dataset defined my strategy. My submitted solution was an automated workflow that directly tackles them. Using a set of open source software tools and off-the-shelf powerful computer vision models, I managed to define an automated, scalable, and reproducible Label correction method that helped me win the competition by outscoring the original MCUNet model and other participants in the competition. The beauty of that is that I only trained on the aforementioned subset of less than 200k images, which is nearly 4% of the entirety of the Wake Vision dataset.

Tools and techniques

FiftyOneApp

At the core of my solution, I relied on one of, if not the, only open source data analysis tool for computer vision: Voxel51 FiftyOne App for Visual Data Analysis. It offers many perks such as:

Easy-to-use WebUI and Python SDK for visualizing and exploring custom or open source image datasets
Hosts a model zoo where the user can apply multiple CV models on the dataset, with no to minimal dependencies.
Enables labelling analysis, clustering techniques, computing embeddings and metadata, and more. It is designed to boost CV workflows for teams building any visual AI applications.

I used the app mainly to:

Load the downloaded data folder and the ground truth CSV file the labeled dataset.
Visualize both images and their corresponding ground truth and perform an initial manual and visual inspection of the dataset.
Apply models' predictions directly within the tool without dependencies.
Compute embeddings and create a 2D explorable embedding space for data analysis.

Embeddings space mapping and exploration

Zero-shot predictions as data analytics tools

YOLOv11 from Ultralytics and CLIP from OpenAI are powerful, pre-trained models that can generate accurate predictions for general tasks like people, animals or vehicles detection. By simply using FiftyOne's Python SDK, one can apply these models on the dataset and generate predictions, and within few minutes, the bounding box detections from YOLO and the image classifications from CLIP can be visualized on the WebUI and accessed in subsequent analysis workflows.

Zero-shot predictions from the models

I used the YOLOv11 object detector to solely generate bounding box detections for the 'person' class. By presuming the YOLO detections to be reliable, if an image contained at least one 'person' detection from the YOLO model, I tagged it as a 'person' image, otherwise, it's a non-person image. As explained in the following image, the logic of correcting wrong ground truth labels consists of comparing these image tags vs the original ground truth label. In short, we basically re-assign the ground truth to the image tag. This results, for example, in flipping the ground truth of those cat images that came with ground truth person to being tagged as background since the YOLO model won't detect a person in such images.

After applying this workflow, I verified the correction's result by re-inspecting some image clusters to doublecheck its effectiveness.

Ground Truth Correction workflow using YOLO detections

The model's predictions were also used to assign new ground-truth for the unlabeleddata, hence defining a semi-supervised labelling technique for the dataset. It reduces class ambiguity by comparing both models' output per image to verify the label. If the models agree, e.g. they both detected the presence of a person or vice versa, both didn't detect a person, then a ground-truth value of person or background, respectively, is assigned to the image. This workflow was also automated, and it can be scaled over all unlabeled images in the Wake Vision Dataset, but also as a tool to label new incoming images in case of an effort to increase the dataset's size.

Semi-Supervised annotation for unlabeled images

The full implementation can be found in my GitHub repo.

Results

My submission was ranked first and selected as a winner by the competition's jury as it achieved the highest accuracy score on the private held-out test set.

Wake Vision Challenge leaderboard

To further showcase the value of my work, I had my own test set when training my submitted model, which I used to benchmark my result vs the openly available MCUNet, trained on the original (no enhancements) Wake Vision Dataset. The accuracy jumped from 0.63 to 0.68. That means that despite decreasing the number of images used, the effect is labeling correction is more crucial for performance improvements and more reliable models.

The solution I submitted was later scaled by the authors over the entirety of the dataset, leading to a reduction of the label error rate from 15.2% down to 9.8%.

The Wake Vision Challenge helped me prove that the future of embedded AI isn't just about shrinking neural networks nor about fitting more TOPS into smaller chips — it's also about being smarter with data. Focusing on data quality is essential to deliver production-grade ML systems that can perform reliably in real-world scenarios.

Results of test metrics improvements when scaling data-centric approach

Conclusion

The Wake Vision Challenge helped me prove that the future of embedded AI isn't just about shrinking neural networks nor about fitting more TOPS into smaller chips —it's also about being smarter with data. Focusing on data quality is essential to deliver production-grade ML systems that can perform reliably in real-world scenarios.

This is an invitation for the Hackster community: as you tackle embedded AI challenges in your projects, remember that data quality isn’t optional—it’s essential. In the future, I will provide a deeper dive into the FiftyOne tool and how it can boost visual AI projects.

Last but not least, a huge shoutout to Jinger Zeng for encouraging me to share this!

At the EDGE AI Milan 2025

data collection

computer vision

artificial intelligence

machine learning

tinyml

Kais Bedioui