Larry Bank (The Performance Whisperer) Is on a Quest to Fix Our Wares Bit by Bit, Byte by Byte

Areas to optimize the way in which your project performs are abundant, you just need to know where to look for them.

5 years ago

Our community grows larger and larger with every passing day.

And to be honest, that should be of little surprise! As hackers, and makers, our remit covers some exceptionally cool things — robots, blinkenlights, IoT devices, and everything in between!

The advancement of technology is so fast paced, that gone are the days when gimbaled rockets were the realm of JPL alone. These days, bipeds built in backyard sheds are now spooling up to take a run at Boston Dynamics.

Indeed, the barrier to entry for all that we do is dropping lower and lower every day.

One reason I might suggest for this is the advancement we've seen in integration. Days of yore would have had circuits carefully crafted from 74HC chips, and programs painstakingly punched in machine bytecode.

This is a far cry from the collection of items that sit on our desk today.

(📷: Octavo Systems)

An entire computer sits in a single chip, and Arduino makes getting an IoT-connected module a mere matter of a few lines of code, sat on top of a simply hugelyunder-appreciated amount of integrated hardware, and maybe equally, if not more importantly, libraries, built on libraries, built on... you get the idea.

So perhaps it's safe to say that the average maker might not entirely appreciate the full stack of hardware and firmware that is being called upon when they write "Serial.println("This is a string!");"...

That's not specifically a bad thing in itself heck, the number of successful, well-functioning projects backs that up for a start — but it does impose two limitations that should likely receive more attention then they do.

I'm going to spend some time in this article waxing about the various ideas behind "performance," in relation to our trade.

Hopefully, we'll throw some light on the bits that often get overlooked in the pursuit of the perfect project.

First up, not knowing what sits below all the high-level functions means that "light" users of an example library often have no idea how to address the errors that can take light when a cryptic compiler warning throws it's toys out of the compilation pram. Eventually though, Googling the right combination of errors will lead them to that magic stack overflow answer, and all will be good in the world. We hope.

The second, possibly less often considered counterpoint to just using out-of-the-box library code, is still just as important when it comes to successful, system-level design. Performance.

Performance is a term that gets thrown about a lot, usually in the context of marketing. It'll be emblazoned of the box of the latest GPU, or touted in the tagline for the next generation of whatever is coming hot off the lines of TSMC.

Thing is, performance isn't just about blazing speeds. It's not just about ease of use. It's not just about the longest battery lifetime. It's about all of these things, and in my opinion, as a blanket term it should aim to cover those, at once. This seems like a sensible take on things?

If your device has blazing fast clock speeds — arguably, a measure of performance itself — then, why should it waste a lot of time in run mode, burning through batteries, just to run bloated code?

Equally, if your device aims to live forever, surely there are ways to eek out some extra "performance", rather than boosting up the CPU clock to crank through crate loads of instructions, just because the code isn't optimized?

Larry Bank — AKA The Performance Whisperer, AKA @fast_code_r_us — feels similarly, and has a mission to pack performance into common libraries that find themselves used regularly on the scene.

With a current focus on display drivers and associated code bases, the results he is obtaining are incredible. We'll talk to Bank soon in an upcoming article, where we get to know him a bit better, and take a look under the hood of some of his recent library releases, including the brilliant Animated GIF library, JPEGDEC libraries, and more.

Boosting performance without breaking the 'Bank!'

I first became aware of the efforts of Bank through some collaborative work with another Hackster favorite — Mike Rankin. Rankin loves the ESP32, and the power that comes with it. However, he also loves getting that beast of a chip running off batteries. If you'd told me that I'd see an ESP32 running from a coin cell, I'd have scoffed, but Rankin has done just that on numerous occasions!

A number of his small, coin-cell powered ESP32 boards are fitted with displays, and in checking out a demo of one of these, I noticed an incredible text display demo that seamlessly morphed into a particle system, with the pixels placed on the screen suddenly, and fluidly succumbing to the forces of gravity, as dictated by the on board accelerometer.

Rankin sure designs some decent hardware, but this beautiful display of pixel pushing comes courtesy of @fast_code_r_us.

After seeing this, I got to speaking to Bank, and we mused on the commonplace occurrence of simply slapping a hellishly powerful microcontroller on the most basic of tasks, with little thought or care given to the idea of optimizing the code to the task at hand.

Pocket powerhouse processors

Nearly every (hardware) project we see these days is based in some part or another on a Cortex 32-bit MCU. While the Cortex-M0+ powered SAM D21 holds its own in terms of market share — driving the Arduino Zero / MKR boards, along with a number of others from the likes of Adafruit, et al, there has been a recent surge in the offerings that are powered by it's bigger brother — the Cortex-M4 based SAM D51 — a similar core to the M0+, but with many many more hours spent in the gym.

These are still microcontrollers however, and despite the clock speeds on some of them, an MCU does not an MPU make.

Until recently, rich, fluid graphical user interfaces have typically looked beyond the MCU world of product offerings, in the search of higher clock rates, extended memory interfaces, and fast, parallel busses over which to sling bits and bytes at a display.

This eventually leads to MPU-based designs, which are (usually) vastly more complex than the breadboard based MCU bit-bashers we see on our Twitter feeds. For a good run down of the differences between MCU and MPU parts, check out Microchip with the application note here.

In a nutshell, it is expected that an MCU simply doesn't cut the mustard in comparison, due to things like slower clock speed, limited peripheral / memory headroom, simpler bus interfaces, etc. and performance figures measured in DMIPS that fall far below those scored by MPU cores.

The gap is closing... kind of.

While things like car infotainment clusters are generally the design-in realm of MPUs, we hobbyists are finding that we can achieve similarly impressive hardware design efforts with the "lowly" MCU-class parts that we are able to buy.

With clock speeds on even a "lowly" M0 approaching 100 MHz in some cases, these low-cost, pint-sized parts are capable of throwing some serious shapes on the many display devices that are available for a dime a dozen.

Thing is, in terms of clock cycles, they can also do a lot more with a lot less than the "stock" demo code that comes with a library will show.

There's a reason that things like Arduino libraries can often be... well, bloated, for lack of a better word.

With many of the libraries having roots that go as far back as the Arduino ecosystem itself, much of the codebase of these libraries will have originally been targeting hardware and MCU cores that simply didn't exist at the time.

Skip forward a little bit in time, and suddenly a torrent of issues on GitHub, signal alarm bells ringing from people who say that this legacy code fails on a certain, newer chipset that has been bought into the Arduino ecosystem.

Despite the layers and levels of code of Arduino that mean all the hardware variants should be exposed via the usual "system" level calls, an edge case has arisen, and something isn't working.

Some time later, a new release is pushed, featuring a "just make it work already" patch, and all is well. The project compiles, and does what it's meant to — from a functional perspective anyway.

Panning (or, paining?) for performance

You can already see some areas where performance has taken a back seat however. The "one glove fits all" approach taken by the goldrush that is the Arduino ecosystem can sometimes be found wanting — when it comes to code efficiency. And to be fair, without doubling, or potentially tripling the size of the code base from which the end program is compiled, there isn't really another way.

It's an incredible feat to have a single .ino file that can be compiled to run on five-plus types of MCU cores, with different resources, peripherals, memory, and even vendor core IP, without a single adjustment to the code.

But in order to play nice across all that silicon, you can't really drill down and dive into all the areas that can be optimized as fully as you might want to.

Sipping power in between Serial.print...

Eeking out every ounce of optimization, in the quest to have the CPU clocked into sleep mode for as long as possible, requires an intimate knowledge of the device being targeted. It's one thing to have "hello world" dump that proof of concept string out of a USART across a range of MCU cores.

It's another thing entirely to have that program then wind down the core, task accomplished, dispose of the instantiated serial object, and do things like clocking down the (by default) powered peripherals, and finally, go to sleep, until the next boot.

The difference in power consumption during the task might not change much, but the current draw between the two sets of code will be different by orders of magnitude — post the printf event!

Granted, for a serial "hello world," this is probably outside of the design spec. But consider how many of us are writing code, with a good understanding of where the program exits, of where it's taking low level control of the power state of the MCU, and other such things to be pondered when proceeding on a quest for battery life.

Digging into display drivers...

Writing a display driver that will function optimally, without a complete concept and list of the full range hardware it will run upon, or the characteristics of the data being pushed, is a tall order indeed.

There are different ways of slinging serial data at a display controller, based on the size of the screen, the size of the active area being updated, and even down to the various peripherals of the MCU that is tasked with talking to said display.

You can sit and iterate through a loop, writing line by line, or worse, pixel by pixel, for an entire display frame worth of data in some cases. The worst libraries will wrap individual pixel updates with a myriad, redundant calls to the display controller, repeatedly telling it to do things it has already done, just because it's easier to code the library that way. It's fair enough, it works, but it's about as far from optimized as you can get.

Conversely, there are libraries that insist on full page updates of the display, which might be needed for some screens, but will insist on the same way of addressing each and every display controller, when it's actually a model that supports "windowed" updates — where only the area of the screen that has changed is updated "on the wire."

These extra calls mean that the CPU spends far more time awake then it needs to, which in turn means your battery spends far more time being drained than it needs to.

Peripherals can help!

There are now things to consider like "autonomous" MCU peripherals, with DMA controller often able to take data from memory, and sequentially pipe it out to a serial output, such as an SPI peripheral — all without ever needing to bring the CPU out of sleep mode. This has massive advantages for low power savings, but again, is near enough impossible to code for a "one library fits all" device perspective.

I could go on, and on, and on...

Some will say that this diatribe, for lack of a better word, against the "one library for all" approach might be unfounded...

Obviously, the Arduino code base would suggest that it's certainly a "good enough" approach — and well, yes — it's "good enough." But does it allow us to do things as well as they could be done?

What's the point in spending extra BoM cost on a larger battery, just because a few, well-placed lines of application-specific code could have saved the need for an extra few hundred mAh?

I've been talking to Bank about some of these points, and in an upcoming article, we're going to pick his brains a bit, and find out what has set light to his passionate pursuit of power savings and performance gains.

Go and have a play, why not?

We'll also cover his incredible contributions to the community code base so far, with recent releases not only of his incredibly well received Animated GIF library, but the more recent JPEGDEC library, providing optimized JPG file handling and decoding on embedded platforms.

If you have ever been left bumping your head against the desk in trying to figure out the correct encoding for bitmaps, or wishing you could work with a slightly less memory intensive image format, his work will be a welcome addition to your GitHub Starred repos!

embedded

Tom Fleet

Hi, I'm Tom!I create content for Hackster News, allowing us to showcase your latest and greatest projects for the world to see!