Benchmarking Machine Learning on the New Raspberry Pi 4, Model B

How much faster is the new Raspberry Pi? It’s a lot faster.

At the start of last month I sat down to benchmark the new generation of accelerator hardware intended to speed up machine learning inferencing on the edge. So I’d have a rough yardstick for comparison, I also ran the same benchmarks on the Raspberry Pi. Afterwards a lot of people complained that I should have been using TensorFlow Lite on the Raspberry Pi rather than full blown TensorFlow. They were right, it ran a lot faster.

Then with the release of the AI2GO framework from Xnor.ai, which uses next generation binary weight models, I looked at the inferencing speeds of these next generation of models in comparison to ‘traditional’ TensorFlow. This also ran a lot faster.

The new Raspberry Pi 4, Model B. (📷: Alasdair Allan)

However with today’s launch of the new Raspberry Pi 4, Model B, it’s time to go back and look again at the benchmarks and see how much faster the new Raspberry Pi 4 is than the previous model. Spoiler? It’s a lot faster.

Headline Results From Benchmarking

Overall the new Raspberry Pi 4 is considerably faster than our original results from the Raspberry Pi 3, and the followup looking at the AI2GO platform.

Inferencing time in milli-seconds for the Raspberry Pi 3 (blue, left) and Raspberry Pi 4 (green, right).

We see an approximate ×2 increase in inferencing speed between the original TensorFlow benchmarks and the new results from the Raspberry Pi 4, along with a similar increase in inferencing speed using the Xnor AI2GO platform.

Benchmarking results in milli-seconds for MobileNet v1 SSD 0.75 depth model and the MobileNet v2 SSD model, both trained using the Common Objects in Context (COCO) dataset with an input size of 300×300, for the Raspberry Pi 3, Model B+ (left), and the new Raspberry Pi 4, Model B (right).

However we see a much bigger change when looking at the results from the Coral USB Accelerator from Google. The addition of USB 3.0 to the Raspberry Pi 4 means we see an approximate ×3 increase in inferencing speed between our original results and the new results.

Conversely the inference times for the Coral USB Accelerator when it was connected via USB 2, rather than the new USB 3 bus, actually increased by a factor of ×2. This somewhat surprising result is likely due to the architectural changes made to the new Raspberry Pi.

“These results showcase both the increased NEON compute throughput of Raspberry Pi 4, and the benefit of including a pair of USB 3.0 ports in the design: we primarily intended these to be used to attach mass-storage devices, so it’s interesting to see another application in the wild.” — Eben Upton, Founder, Raspberry Pi Foundation

Part I — Benchmarking

A More In-Depth Analysis of the Results

Our original benchmarks were done using both TensorFlow and TensorFlow Lite on a Raspberry Pi 3, Model B+, and these were rerun using the new Raspberry Pi 4, Model B, with 4GB of RAM. Inferencing was carried out with the MobileNet v2 SSD and MobileNet v1 0.75 depth SSD models, both models trained on the Common Objects in Context (COCO) dataset. Benchmarks using the Coral USB Accelerator were similarly rerun with the accelerator dongle attached to both the USB 2 and USB 3 bus of the Raspberry Pi 4.

ℹ️ Information Our original benchmarks compared inferencing on the following platforms; the Coral Dev Board, the NVIDIA Jetson Nano, the Coral USB Accelerator with a Raspberry Pi 3, Model B+, the original Movidus Neural Compute Stick with a Raspberry Pi 3, Model B+, and the second generation Intel Neural Compute Stick 2 again with a Raspberry Pi 3, Model B+. Finally as a yard stick, we ran the same models again on my Apple MacBook Pro (2016), which has a quad-core 2.9 GHz Intel Core i7, and a vanilla Raspberry Pi 3, Model B+ without any acceleration.

The Xnor.ai AI2GO platform was benchmarked using their ‘medium’ Kitchen Object Detector model. This model is a binary weight network, and while the nature of the training dataset is not known, some technical papers around the model are available.

A single 3888×2916 pixel test image was used containing two recognisable objects in the frame, a banana🍌 and an apple🍎. The image was resized down to 300×300 pixels before presenting it to each model, and the model was run 10,000 times before an average inferencing time was taken. The first inferencing run, which takes longer due to loading overheads in the case of TensorFlow models, was discarded.

⚠️Warning While benchmarks were run for TensorFlow, AI2GO, and the Coral USB Accelerator, updates to Raspbian necessary to support the board — from Raspbian Stretch to Raspbian Buster — mean that the installed Python version has moved from Python 3.5 to 3.7. This change meant that I was unable to run benchmarks for TensorFlow Lite, the Movidus Neural Compute Stick, or the Intel Neural Compute Stick 2. While the TensorFlow Lite problems are probably going to be resolvable fairly easily, moving the Intel OpenVINO framework from Python 3.5 to 3.7 will take some time to accomplish. So you should therefore not expect the Intel Neural Compute Stick to work with the Raspberry Pi 4 in the near term.

Overall for CPU-based models we see a rough ×2 increase in performance.

With roughly twice the NEON capacity more than the Raspberry Pi 3, we would expect this order of speedup in performance for well-written NEON kernels and as expected, after thermal throttling issues were addressed, we saw a rough ×2 increase in performance for both the MobileNet v1 models, and the Xnor.ai AI2GO framework.

The performance improvements seen with the AI2GO platform binary weight models, with an observed inferencing time of 79.5 ms on an unaccelerated Raspberry Pi 4, is directly comparable with the MacBook Pro (2016) which had an inferencing time of 71 ms for MobileNet v2 SSD.

However the much smaller speed up we see for the MobileNet V2 models is intriguing, suggesting that the v2 model may be using very different TensorFlow operations, which are not optimised well for the architecture.

Inferencing time in milli-seconds for the for MobileNet v1 SSD 0.75 depth model (left hand bars) and the MobileNet v2 SSD model (right hand bars), both trained using the Common Objects in Context (COCO) dataset with an input size of 300×300. The (single) bars for the Xnor AI2GO platform use their proprietary binary weight model. All measurements on the Raspberry Pi 3, Model B+, are in yellow, measurements on the Raspberry Pi 4, Model B, in red. Other platforms are in green.

While inferencing using TensorFlow Lite wasn’t carried out, due to the move from Python 3.5 to 3.7 breaking the Python wheel, I would also expect to see a rough ×2 speedup during inferencing for these models for the same reason.

However probably the biggest takeaway for those wishing to use the new Raspberry Pi 4 for inferencing is the performance gains seen with the Coral USB Accelerator. The addition of USB 3.0 to the Raspberry Pi 4 means we see an approximate ×3 increase in inferencing speed over our original results.

Benchmarking results in milli-seconds for the Coral USB Accelerator using the MobileNet v1 SSD 0.75 depth model and the MobileNet v2 SSD model, both trained using the Common Objects in Context (COCO) dataset for the Raspberry Pi 3, Model B+ (left), and the Raspberry Pi 4, Model B over USB 3.0 (middle) and USB 2 (right).

That is a decrease in inferencing time from 49.3 ms down to 14.9 ms for the MobileNet v1 0.75 depth SSD model, and a decrease from 58.1 ms down to 18.2 ms for the MobileNet v2 SSD model. That actually brings the inferencing times for the the Raspberry Pi 4 below those from the Coral Dev Board, which had 15.7 and 20.9 ms times for the models respectively.

Conversely however the inference times for the Coral USB Accelerator when it was connected via USB 2, rather than the new USB 3 bus, actually increased by a factor of ×2. This somewhat surprising result is mostly likely due to the architectural changes made to the new Raspberry Pi. With the XHCI host now at the far end of the PCI Express bus, there’s potentially much more latency in the system. Depending on the traffic pattern you could imagine that blocking, as opposed to streaming, use of the channel could well be slower.

ℹ️ Information While the pre-release board I was using was had 4GB of RAM it’s unlikely that for the Coral USB Accelerator, where inferencing is done ‘off board’ on the Edge TPU itself, that this would significantly affect the result and would expect to see the same benchmark numbers, or at worst broadly similar, for Raspberry Pi 4 boards with 1GB or 2GB of RAM on board.

Environmental Factors

While inferencing speed is probably our most important measure, these are devices intended to do machine learning at at the edge. That means we also need to pay attention to environmental factors. Designing a smart object isn’t just about the software you put on it, you also have to pay attention to other factors, and here we’re especially concerned with heating and cooling, and the power envelope. Because it might be necessary to trade off inferencing speed against these other factors when designing for the Internet of Things.

Therefore, along with inferencing speed, when discussing edge computing devices it’s also important to ascertain the heat and power envelopes. So lets go do that now.

Power Consumption

Current measurements were made using a multi-meter inline with the USB cable with a reported accuracy of ±0.01 A (10mA).

Idle and peak current consumption for our benchmarked platforms before and during extended testing. All measurements for USB connected accelerated platforms were done using a Raspberry Pi 3, Model B+.

Except for the MacBook Pro, all of our platforms take a nominal 5V input supply. However in reality the voltage will bounce around somewhat due to demands made by the board, and most USB supplies actually sit at around +5.1 to +5.2V. So when doing rough calculations to get the power (in Watts) I’d normally take a the voltage of a USB supply to be +5.15V as a good supply will usually try and maintain the supplied voltage around this figure despite rapid fluctuations in current draw.

Those fluctuations in demand is something that happens a lot with when you’re using peripherals with the Raspberry Pi and often cause brown outs, and they are something that a lot of USB chargers — designed to provide consistent current for charging cellphones — usually don’t cope with all that well. This is one of the reasons why the new Raspberry Pi 4 has transitioned from micro USB to the USB-C standard.

Idle current (in green, left hand bars) compared to peak current (in yellow, right hand bars).

During our previous benchmarking we saw that the Raspberry Pi 3, Model B+, was comparatively power hungry, with only the NVIDIA Jetson Nano needing a larger power envelope. Our new measurements show that the new Raspberry Pi 4 is the worst performer of the platforms, needing over 1,400mA at peak during extended testing. It also has the highest resting consumption, drawing more idle current than the Coral Dev Board.

Heating and Cooling

In previous extended tests we saw Raspberry Pi temperatures approach, but not exceed, the 80°C point where thermal throttling of the CPU would occur during inferencing using TensorFlow and TensorFlow Lite models.

My initial results for the AI2GO benchmark gave an inferencing time of 90.9 ms, which was considerably higher than expected. However during these test runs we observed temperatures well above the thermal throttling threshold.

$ vcgencmd measure_temp
temp=84.0'C
$ vcgencmd measure_clock arm
frequency(48)=1000265600

The addition of a small fan, driven from the Raspberry Pi’s own GPIO headers, was sufficient to keep the the CPU temperature stable at 45°C during testing.

A small fan was sufficient to keep the CPU temperature stable.

After stabilising the CPU temperature the inferencing speed time decreased, dropping from 90.9 ms down to 79.5 ms. This result is a more in line with the expected result with roughly twice the NEON capacity on the new board.

However due to this necessity to actively cool the Raspberry Pi during testing I’d recommend that, if you intended to use the new board for inferencing for extended periods, you should add at least a passive heatsink. Although to ensure that you avoid the possibility of CPU throttling entirely it’s likely that a small fan might be a good idea.

Because let’s face it, CPU throttling can spoil your day.

Summary

The performance increase seen with the new Raspberry Pi 4 makes it a very competitive platform for machine learning inferencing at the edge.

Benchmarks using the AI2GO platform and the binary weight network models shows inferencing time competitive with the NVIDIA Jetson Nano using their TensorRT optimised TensorFlow models. However it is the addition of the USB 3.0 bus on the new board that makes it not just speed, but price competitive with our previous ‘best in class’ board, the Coral Dev Board from Google.

Priced at $35 the 1GB version of the new Raspberry Pi 4 is significantly cheaper than the $149 Coral Dev Board. Adding an additional $74.99 for the Coral USB Accelerator to the price of the Raspberry Pi means that you can outperform the previous ‘best in class’ board for a cost of $109.99. That’s a saving of $39.01 over the cost of the Coral Dev Board, for better performance.

Part II — Methodology

Preparing the Raspberry Pi

Fortunately despite the differences between the new Raspberry Pi 4 and previous generations, installation of the supporting software we needed wasn’t too different. However, there were some hiccups along the way.

Installing the Coral Software

Unfortunately as I was running the Coral Software Development Kit on a brand new Raspberry Pi board that was still secret and the team at Google hadn’t even heard about yet, I couldn’t install things as normal. Fortunately there were only some small fixable problems with the shipping install script. It’s likely that these problem will be quickly resolved. However until then you’ll need to make some tweaks before things will install and run.

Go ahead and download the software development kit using wget, and uncompress the bundle into your home directory.

$ wget http://storage.googleapis.com/cloud-iot-edge-pretrained-models/edgetpu_api.tar.gz$ tar -xvzf edgetpu_api.tar.gz
$ cd python-tflite-source

But before running the installation script I had to make some changes. The install.sh script relies on yet another script called platform_recognizer.sh to figure out what platform the Coral SDK is being deployed into, and install the appropriate libraries. I went ahead and added the following lines,

elif [[ "$board_version" == "Raspberry Pi"* ]]; then
   platform="other_raspberry"
   echo -e "${GREEN}Recognised some other Raspberry Pi"

into the decision tree in the platform_recognizer.sh script. Which means that the contents of the /proc/device-tree/model file,

$ cat /proc/device-tree/model
Raspberry Pi ? Rev 1.1

for my pre-release version of the hardware and software was recognised. I then modified the install.sh script to accept this as a valid answer,

elif [[ "$platform" == "raspberry_pi_3b" ]] || [[ "$platform" == "raspberry_pi_3b+" ]] || [[ "$platform" == "other_raspberry" ]];then

in both places where the script checks for the platform, which would be in lines 64 and 109. We’ll also need to split line 92 into two separate lines,

sudo udevadm control --reload-rules 
sudo udevadm trigger

due to some changes between Debian Stretch and Buster.

Finally the install script is expecting Python 3.5, and the newest version of Raspbian that shipped with the Raspberry Pi 4 is a flavour of Debian Buster which comes with with Python 3.7. So you’ll also have to modify line 116 of the install.sh script, changing python3.5 to python3.7,

python3.7 setup.py develop --user

before you can run the installation script,

$ ./install.sh

which should now successfully complete.

Once the installation has completed, go ahead plug in the USB Accelerator using the short USB-C to USB-A cable that accompanied the USB stick in the box. If you’ve already plugged it in, you’ll need remove it and replug it, as the installation script adds some udev rules that allows software running on the Raspberry Pi to recognise that the Edge TPU hardware is present.

Installing TensorFlow

Installing TensorFlow on the Raspberry Pi used to be a difficult process, however towards the middle of last year everything became a lot easier.

Unfortunately the officially released wheel has some problems with Python 3.7. In theory that means we’d either have build TensorFlow from source and all its dependencies, or downgrade back to Python 3.5. Neither of which is a particularly pleasant thought since the state of the the Raspbian Buster package repository is still somewhat in flux during pre-release.

Fortunately Pete Warden came through with a candidate wheel for Python 3.7, and Ben Nuttall provided me wheels for all necessary dependencies. These will be made official soon, so it’s likely that by the time of release you should therefore be able to install TensorFlow using the official method,

$ sudo apt-get install libatlas-base-dev
$ sudo apt-get install python3-pip
$ pip3 install tensorflow

but check this GitHub issue before proceeding to make sure that’s the case.

Installing AI2GO

The AI2GO platform installs and runs out of the box on the new Raspberry Pi 4, so you can just follow the instructions in the methodology section of my previous benchmarks using the Raspberry Pi 3 to configure and download your model bundle.

Once you’ve downloaded it, go ahead and install the model bundle,

$ cd ~/kitchen-object-detector-medium-300
$ pip3 install xnornet-1.0-cp35-abi3-linux_armv7l.whl
Processing ./xnornet-1.0-cp35-abi3-linux_armv7l.whl
Installing collected packages: xnornet
Successfully installed xnornet-1.0
$

although be aware that if you’ve previously installed another model bundle, you need to ensure you’ve uninstalled it first before installing a new one.

Problems with TensorFlow Lite

While the official TensorFlow binary distribution does not include a build of TensorFlow Lite, there is an unofficial distribution which does. Unfortunately this wheel has not been updated to support Raspbian Buster and Python 3.7. However it’s likely that situation will change after the new Rapsberry Pi has been officially released, at which point I’ll probably go back and take another look at TensorFlow Lite.

Problems with the Neural Compute Stick

The software to support the Neural Compute Stick is the OpenVINO toolkit, and right now there is no support for running the toolkit under Python 3.7 which is what is shipped with Raspbian Buster. I unfortunately couldn’t perform benchmarking for the Movidus Neural Compute Stick, or the Intel Neural Compute Stick 2. Based on past performance it’s likely that updating the Raspberry Pi card image may take some time.

⚠️Warning You should not expect the Movidius Neural Compute Stick or the Intel Neural Compute Stick 2 to work with the Raspberry Pi 4 in the near term.

The Benchmarking Code

The code from our previous benchmarks was reused unchanged.

Benchmarking Edge Computing - All the resources needed to reproduce the benchmarking timing runs.

In Closing

Comparing these platforms on an even footing continues to be difficult. But it is clear that the new Raspberry Pi 4 is a solid platform for machine learning inferencing at the edge.

Links to Previous Benchmarks

If you’re interested in details of around the previous benchmarks.

Benchmarking Edge Computing - Comparing Google, Intel, and NVIDIA accelerator hardware

Benchmarking TensorFlow and TensorFlow Lite on the Raspberry Pi - I recently sat down to benchmark the new accelerator hardware that is now appearing on the market intended to speed up…

Benchmarking the Xnor AI2GO Platform on the Raspberry Pi - I recently sat down to benchmark the new accelerator hardware that is now appearing on the market intended to speed up…

Links to Getting Started Guides

If you’re interested in getting started with any of the accelerator hardware I used during my benchmarks, I’ve put together getting started guides for the Google, Intel, and NVIDIA hardware I looked at during the analysis.

Hands on with the Coral Dev Board - Getting started with Google’s new Edge TPU hardware

How to use a Raspberry Pi to flash new firmware onto the Coral Dev Board - Getting started with Google’s new Edge TPU hardware

Hands on with the Coral USB Accelerator - Getting started with Google’s new Edge TPU hardware

Getting Started with the Intel Neural Compute Stick 2 and the Raspberry Pi - Getting started with Intel’s Movidius hardware

Getting Started with the NVIDIA Jetson Nano Developer Kit - Getting started with NVIDIA’s GPU-based hardware

machine learning

Alasdair Allan

Scientist, author, hacker, maker, and journalist. Building, breaking, and writing. For hire. You can reach me at 📫 alasdair@babilim.co.uk.