Arm Announces Updated AI and Deep Learning Framework for IoT Hardware

Disclaimer: Opinions expressed here are my own and not my employer’s. I work for Arm!

Rex St. John
3 years agoMachine Learning & AI

Disclaimer: Opinions expressed here are my own and not my employer’s. I work for Arm!

This week, Arm announced an update to Compute Library which includes a variety of new functions to help AI developers make better use of Arm compute hardware. Announced earlier in the year, Compute Library provides embedded developers with a toolkit for building AI-enabled low-cost IoT devices, I wanted to take a minute to talk about why this is an interesting development for hardware developers and makers.

Compute Library Enables AI for All

AI, perceptual computing, deep learning on hardware devices have become an increasingly hot topic as tech co’s seeking a competitive edge in next-generation IoT use-cases have poured billions into new and more exotic forms of silicon.

While people may be familiar with ASICs, GPUs and FPGAs for deep learning and AI-related tasks, what many don’t realize is that a sizable amount of AI-enabling hardware on the market comes in the form of GPUs packaged on mobile SoCs. Arm Mali GPUs, for example, shipped in nearly 1 billion devices last year (2016), mostly mobile phones.

The picture becomes more interesting once you realize that these same Mobile SoCs (which are initially produced for consumption by handset makers) will ultimately make their way down the value chain and flow out into IoT, creating a tremendous “long-tail” of low-cost devices capable of AI calculations.

This is where Arm Compute Library comes into the picture. What Compute Library does is optimize and accelerate a variety of common, useful algorithms and functions frequently used in computer vision for Arm architectures to help enable this long tail of gadgets.

Here are a few highlights of what is contained in the latest update:

OpenCL C (targeting Mali GPUs):

  • Bounded ReLu
  • Depth wise convolution (used in mobileNet)
  • De-quantization
  • Direct convolution 1x1
  • Direct convolution 3x3
  • Direct convolution 5x5
  • Flattening for 3D tensor
  • Floor
  • Global pooling (used in SqueezeNet)
  • Leaky ReLu
  • Quantization
  • Reduction operations
  • ROI pooling

CPU (NEON):

  • Bounded ReLu
  • Direct convolution 5x5
  • De-quantization
  • Floor
  • Leaky ReLu
  • Quantization
  • New functions with fixed point acceleration

Also announced were a series of micro-architecture optimizations:

When we started the Compute Library project, our primary purpose was to share a comprehensive set of low level functions for computer vision and machine learning that provided good performance — but most importantly that was reliable and portable. The library is there to reduce cost and time efforts by developers and partners targeting Arm processors, whilst at the same time, also to behave well across the many system configurations that our partners implement. This is why we chose to use NEONintrinsics and OpenCL C as the target languages. However, there are cases where it is critical to extract every ounce of performance from the hardware. We therefore looked at adding to the library low-level primitives optimised using hand-coded assembly tailored to the micro-architecture of the target CPU.

If any of these areas interest you, take a look at the blog post and learn more.

Related articles
Sponsored articles
Related articles