Over the last six months I’ve been looking at machine learning on the edge, publishing a series of articles trying to answer some of the questions that people have been asking about inferencing on embedded hardware.
But, after a half year of posts, talks, and videos, it’s all bit of a sprawling mess and the overall picture is of what’s really happening is rather confusing.
So here’s a great big benchmarking roundup!
While some people have dismissed the idea of benchmarks for inferencing as irrelevant because “…it’s training times that matter,” that doesn’t really seem justified. While if you take an academic approach to machine learning you often will train thousands of different models to find one that is ‘paper worthy’ but this does not seem to be how things work out in the world.
Instead for embedded systems training is a sunk cost with the final model being used thousands, perhaps even millions, of times depending on how many systems make use of it. Those models will also tend to hang around, potentially for decades if you’re talking about hardware that’s going into factories, homes, or public spaces. So in the long term it’s how fast those models run on the embedded hardware that’s important, not how long they took to train.
Discussion of the methodology behind the benchmarks can be found in the original post in the series, while the latest results can be found below, and are also discussed in both the first and the final post in the series.
While inferencing speed is probably our most important measure, these are devices intended to do machine learning at at the edge. That means we also need to pay attention to environmental factors.
Designing a smart object isn’t just about the software you put on it, you also have to pay attention to other factors, and here we’re especially concerned with heating and cooling, and the power envelope. Because it might be necessary to trade off inferencing speed against these other factors when designing for the Internet of Things.
Discussion of environmental factors like power consumption and heating and cooling can mostly be found in the original post in the series. Although some discussion is scattered through the followup posts where warranted.
Getting Started with Google’s Edge TPU
The Coral Edge TPU-based hardware was found to be ‘best in class’ according to our benchmark results. With the addition of the USB 3 to the Raspberry Pi 4, Model B, the Coral USB Accelerator is the fastest accelerator platform that is currently available.
Getting Started with the Intel’s Movidius
While first to market Intel’s Movidius-based hardware may now be showing its age. While most of the boards, cards, sticks, and other widgets you see advertising themselves as machine learning accelerators are actually based around Movidius hardware our benchmarks show poor performance when compared with Google’s newer Edge TPU hardware.
Getting Started with NVIDIA’s GPUs
A major take away from our benchmarking on the NVIDIA Jetson Nano dev kit is that you want things to run quickly you need to optimise your TensorFlow model using NVIDIA’s own TensorRT framework.
We also seen that while NVIDIA’s GPU-based hardware is more flexible, that extra capability comes with a speed penalty. NVIDIA’s board is built around their existing GPU technology, while Google’s Edge TPU hardware is aimed directly at running smaller quantised models. We’re seeing that the Edge TPU hardware is faster because running smaller models at the edge is what its designed to do.
Benchmarking Machine Learning
Our original benchmarks were run before the arrival of the Raspberry Pi 4, Model B. However, our main results were only reinforced by the arrival of the newer hardware, and they are really starting make me wonder whether we’ve gone ahead and started optimising in hardware just a little too soon. The fast inferencing times we see with Xnor.aiAI2GO framework, and the Edge TPU, both of which make use of quantisation suggest that we may need to explore software strategies before continuing to optimise our hardware any further.
Machine Learning on the Raspberry Pi 4
Benchmarking the Intel Neural Compute Stick on the New Raspberry Pi 4, Model B - The last in a series of articles on machine learning and edge computing comparing Google, Intel, and NVIDIA accelerator…
Perhaps the biggest takeaway for those wishing to use the new Raspberry Pi 4 for inferencing is the performance gains seen with the Coral USB Accelerator. The addition of USB 3 to the Raspberry Pi 4 means we see an approximate ×3 increase in inferencing speed over our original results using the Raspberry Pi 3 and USB 2.
The somewhat surprising result of slower inferencing for the Raspberry Pi 4 and USB 2 is mostly likely due to the architectural changes made to the new Raspberry Pi.
But it’s not until we look at TensorFlow Lite on the Raspberry Pi 4 that we see the real surprise. Here we see between a ×3 and ×4 increase in inferencing speed between our original TensorFlow benchmark, and the new results using TensorFlow Lite.
Due to this necessity to actively cool the Raspberry Pi during testing I’d recommend that if you intended to use the new board for inferencing for extended periods, you should add at least a passive heatsink. Although to ensure that you avoid the possibility of CPU throttling entirely it’s likely that a small fan might be a good idea.
The Benchmarking Code
If you’re interested in reproducing these results, or just want to get a much better understanding of my methodology, I’ve made all the resources you’ll need to run and duplicate the benchmark results available for download.
The Great Big Roundup
The Raspberry Pi 4 is probably the cheapest, most affordable, most accessible way to get started with embedded machine learning right now. Use it on its own with TensorFlow Lite for competitive performance, or with the Coral USB Accelerator from Google for ‘best in class’ performance.