Get an Edge in Edge AI

Hardware-Aware Transformers can smooth out your rough AI edges.

Nick Bild
6 years agoAI & Machine Learning

Researchers at the Massachusetts Institute of Technology have developed a sophisticated throw everything at the wall and see what sticks approach to reducing inference latency on edge AI devices. They call their approach Hardware-Aware Transformers (HAT).

A Transformer is a type of machine learning model originally developed by researchers at Google. While Transformers are arguably best in class performers when dealing with data that is in an ordered sequence (e.g. natural language processing), the computational resources required by the model can easily bring edge devices to their knees.

For example, translating a 30-word sentence (using a Transformer-Big model) requires 20 seconds of processing time on a Raspberry Pi Arm Cortex-A72 CPU. If you had to wait more than a few seconds for your smart speaker to turn the lights off, you would probably just do it the old fashioned way. That is where HATs come in.

To create a HAT, the initial step involves constructing a large Transformer model with heterogeneous layers capable of adapting to varied hardware constraints. This large model, a so-called "SuperTransformer," is composed of many Sub-Transformers. To reduce training time, only the SuperTransformer itself is trained. Sub-Transformers are sampled during training, and latency metrics are calculated. These Sub-Transformer latency metrics provide feedback to refine the overall model weights.

With a trained SuperTransformer, the final step is to conduct an evolutionary search to determine which Sub-Transformer yields the best performance for a given set of hardware constraints.

The authors tested HATs against several large datasets (WMT'14 En-De, WMT'14 En-Fr, WMT'19 En-De, and IWSLT'14 De-En) on a Raspberry Pi Arm CPU, Intel Xeon, and an NVIDIA TITAN Xp GPU. They concluded that, as compared to the Transformer-Big model, HATs can achieve up to a threefold inference speedup without loss of accuracy. This is especially impressive when considering that the models are also up to 3.7x smaller.

For machine learning problems suited to the HAT architecture, it may offer substantial reductions in inference latency as compared to other existing methods.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles