Making a Split Decision
Researchers built a split learning testbed on ESP32-S3 boards, revealing how to run AI faster and smarter on tiny edge devices.
In order to reduce latency, enhance user privacy, and minimize energy use, the future of artificial intelligence (AI) has to be more edge-based and decentralized. At present, most of the cutting-edge AI algorithms available consume so many computational resources that they can only run on powerful hardware in the cloud. But as more and more use cases arise that do not fit this prevailing paradigm, efforts to optimize and shrink algorithms down to size for on-device execution are picking up steam.
In an ideal world, any AI algorithm you might need would be perfectly comfortable running directly on the hardware that produces the data it analyses. But we are still a long way from that goal. Moreover, we cannot simply wait for major technological innovations to be achieved — we have needs that must be met now. For this reason, some compromises have to be made. We may not be able to run the algorithm we need entirely on a microcontroller, but perhaps with a boost from some nearby edge systems, we can make things work anyway.
That is the basic idea behind a technique called split learning (SL), in which microcontrollers may execute the first few layers of a neural network before transmitting those results to a nearby machine that finishes the job. In this way, SL preserves privacy by transmitting data (intermediate activations) that is generally uninterpretable. Furthermore, latency is reduced since the machines can communicate via a local network.
SL is still an area that is heavily experimental, however. How well does it work, and under what conditions? What are the best networking protocols to use? How much time can be saved? We do not have any comprehensive studies answering these types of questions, so a team at the Technical University of Braunschweig in Germany set out to get some answers. They designed an end-to-end TinyML and SL testbed built around ESP32-S3 microcontroller development boards and benchmarked a variety of solutions.
The researchers chose to implement their system using MobileNetV2, a compact image classification neural network architecture commonly used in mobile environments. To make the model small enough to run on ESP32 boards, they applied post-training quantization, reducing the model to 8-bit integers and splitting it at a layer called block_16_project_BN. This decision resulted in a manageable 5.66 KB intermediate tensor being passed between devices.
Four different wireless communication protocols were tested: UDP, TCP, ESP-NOW, and Bluetooth Low Energy (BLE). These protocols vary in terms of latency, energy efficiency, and infrastructure requirements. UDP showed excellent speed, achieving a round-trip time (RTT) of 5.8 seconds, while ESP-NOW outperformed all others with an RTT of 3.7 seconds, thanks to its direct, infrastructure-free communication model. BLE consumed the least energy but suffered the highest latency, stretching over 10 seconds due to its lower data throughput.
In all cases, the team used over-the-air firmware updates to remotely deploy their partitioned neural network models to the microcontrollers. The edge server, a desktop PC in this case, handled all training, splitting, quantization, and firmware generation tasks. Each part of the split model was compiled into a standalone Arduino firmware image and flashed onto different ESP32 devices. One board captured images from a connected camera and ran the first half of the model, while another completed the inference process.
Ultimately, no single solution is right for every application. But with benchmarks such as those produced in this work, we have the raw information we need to choose the right tool for each job.