Novel Benchmark Finds That Efficiency Wins Out Over Raw Power for On-Device TinyML
New benchmark aims to answer a big question: is it better to run your hardware full-throttle, or to scale things down?
Researchers from the Politecnico di Milan, working with STMicroelectronics and eyewear firm EssilorLuxottica, have come up with a new approach to benchmarking the efficiency of on-device tiny machine learning (tinyML) running on resource-constrained microcontrollers — taking into account both energy usage and latency.
"The rise of IoT [Internet of Things] has increased the need for on-edge machine learning, with tinyML emerging as a promising solution for resource-constrained devices such as MCU [Microcontroller Units]," the researchers write by way of introduction. "However, evaluating their performance remains challenging due to diverse architectures and application scenarios. Current solutions have many non-negligible limitations. This work introduces an alternative benchmarking methodology that integrates energy and latency measurements while distinguishing three execution phases: pre-inference, inference, and post-inference."
The aim: to solve what the team claims are drawbacks to using existing benchmarks, such as TinyML Perf from MLCommons, including a reliance on powering the device under test from specialized monitoring hardware rather than its usual supply and a failure to distinguish between the power draw of the energy-hungry inference stage and the work immediately before and after.
To prove the concept — which splits the work into pre-inference, inference, and post-inference sections and uses a dual-trigger approach to separate each phase for more repeatable measurement — the team tested it on the STMicro STM32N6 microcontroller and its integrated neural coprocessor. To goal: to figure out whether it was better to run the chip at its peak performance, and thus get the work done more quickly, or to tune it for higher energy efficiency at the cost of taking longer to complete a given workload.
"Our findings demonstrate that reducing the core voltage and clock frequency improve the efficiency of pre- and post-processing without significantly affecting network execution performance," the team concludes. "This approach can also be used for cross-platform comparisons to determine the most efficient inference platform and to quantify how pre- and post-processing overhead varies across different hardware implementations."
The team's work is available as a preprint on Cornell's arXiv server.