The future of machine learning (ML) is looking increasingly tiny. Tiny in terms of the hardware that the algorithms run on, that is. This is in large part because running ML models on embedded edge devices offers energy efficiency and improvements with respect to privacy, latency, and security. But this move from the cloud to low-power, resource-constrained devices is easier said than done, however. There is the not-so-tiny problem of adapting the algorithms to fit the hardware, adapting the hardware to fit the algorithm, or both, to consider.
A team of engineers at Harvard, Purdue, and Google have developed an open source framework called CFU Playground that was designed to help create customized hardware solutions that can dramatically speed up the execution time of ML algorithms. By making use of reconfigurable hardware in a full-stack solution, it is possible to get rapid feedback from designs, and quickly iterate through ML accelerator revisions.
The new framework was built because tinyML is a fast moving area of research. As such, the high costs and development times associated with creating purpose-built hardware prohibit the rapid prototyping that is needed to make new advancements. At the core of CFU Playground is a set of tools to support building accelerators in an FPGA. With a RISC-V processor simulated in the fabric of the FPGA, and support for TensorFlow Lite, a user can quickly get their ML models up and running. From there, the user can identify portions of the algorithms that could be sped up, and design custom function units (CFU) in the FPGA's reconfigurable hardware to handle those bits of the processing much faster than the general purpose main processor could.
Deploying a new design to the FPGA takes only a few minutes, so it is a quick process to experiment with and test new ideas. This framework also allows a user to iteratively add additional CFUs to the design as opportunities to improve algorithm performance are identified. By growing the CFUs, one-by-one, it is possible to construct something very near to a full-blown ML accelerator.
The engineers put CFU Playground through its paces in accelerating the convolution operation of the MobileNetV2 neural network. They implemented a number of optimizations, including loop unrolling, SIMD multiply-accumulate, and pipelining. In only a few weeks time, a single engineer was able to achieve a massive 55-fold speed increase in this operation. In a similar exercise, the researchers applied their methods to a Keyword Spotting application, and found that they were able to achieve a staggering 75-fold speed increase with a strategic use of parallelism and pipelining.
Whether the final target device is to be an FPGA or an ASIC, CFU Playground has some great tools available to speed up the development process. And thanks to the open source design of the framework, it is compatible with many FPGA development platforms. The source code is freely available online for anyone interested in giving it a spin.