TinyML in a Parallel Universe

A new open source RISC-V embedded GPU brings powerful parallel processing capabilities to ultra-low-power devices for tinyML applications.

Nick Bild
1 month agoMachine Learning & AI

GPUs are in many ways the big iron of the modern age. Without their massively parallel processing capabilities, the field of artificial intelligence (AI) would be decades behind where it is today. But while they are highly prized for their computing performance, cutting-edge GPUs are certainly not known for being energy-efficient or portable. They are quite at home in a large data center slurping on power while they crunch numbers inside of a powerful machine, but the idea of fitting a GPU into a tinyML device is total nonsense.

TinyML devices could really use some parallel processing of their own, however. Running AI algorithms in the cloud is only a stopgap measure — the goal has always been to run the algorithms exactly where they are needed, on-device. That may still be a bridge too far when discussing the latest and greatest GPU technology, but researchers at the Swiss Federal Technology Institute of Lausanne think it is time to start dipping our toes in the water. They have developed an embedded GPU (e-GPU) that can bring serious parallel processing to the tiniest of hardware platforms.

Any solution that wants to operate in the world of tinyML needs to address a number of issues. First and foremost, energy consumption levels must be on par with the ultra-low-power host devices. Also, the footprint of an e-GPU needs to be small — very small — to be practical for embedded applications. And finally, a suitable programming framework is required for the platform. CUDA may not be an option, but there still must be some practical way to make use of the available hardware resources.

To solve these challenges, the team designed a configurable open source platform based on the RISC-V architecture. This tiny graphics processor was built with edge computing in mind — specifically for applications where space and power are strictly limited. Unlike traditional GPUs, which prioritize peak performance, the e-GPU platform focuses on balancing compute capability with energy and area efficiency.

Introduced alongside the e-GPU is Tiny-OpenCL, a lightweight GPU programming framework tailored for constrained environments. Traditional frameworks like OpenCL or CUDA assume features such as multithreading and file systems — luxuries that microcontrollers and embedded systems usually lack. Tiny-OpenCL strips away the bulk while preserving the core features needed to write and execute parallel programs on the e-GPU.

To prove the viability of the concept, the team integrated the e-GPU with an X-HEEP microcontroller, creating what they call an Accelerated Processing Unit, specifically designed for tinyML tasks. Implemented using TSMC’s 16 nm CMOS technology and running at 300 MHz with a power budget of just 28 milliwatts, the prototype offers a real-world demonstration of what is possible with their ultra-efficient hardware design.

The system was evaluated using two key benchmarks: General Matrix Multiply, which measures scheduling overhead from Tiny-OpenCL, and a biosignal processing workload (TinyBio), which tests application-level performance and energy usage. The results were impressive — in high-end configurations with 16 threads, the e-GPU achieved up to a 15.1x speed-up compared to the baseline host CPU, while using only 2.5x more area and reducing energy consumption by up to 3.1x.

By releasing the entire platform as open source, the researchers are inviting the broader community to experiment, adapt, and improve upon the design. Whether for lightweight AI inferences or other embedded applications, this e-GPU could be the first step toward bringing parallel processing to the tiniest of computing platforms.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles