Semi-Decoupled Co-Design Offers a Big Speed Boost for Finding Optimal Neural Architectures

Designed to reduce the search space by "an order of magnitude," this new approach makes finding an optimal design considerably faster.

A quartet of researchers from the University of California at Riverside (UC Riverside) and Notre Dame have offered look at what they claims is a "fast and optimal" method of co-design for neural network accelerators, based on a semi-decoupled approach.

"Hardware-software co-design has been emerging to fully reap the benefits of flexible design spaces and optimize neural network performance," the team explains of previous work in the field. "Nonetheless, such co-design also enlarges the total search space to practically infinity and presents substantial challenges."

The proposed solution: Rather than fully decoupling the neural accelerator design, only semi-decoupling it through use of a "proxy accelerator" — reducing the design space "by orders of magnitude," yet still being able to produce close-to-optimal design in considerable less time than prior approaches.

"We first perform neural architecture search to obtain a small set of optimal architectures for one accelerator candidate," the team writes. "Importantly, this is also the set of (close-to-)optimal architectures for other accelerator designs based on the property that neural architectures' ranking orders in terms of inference latency and energy consumption on different accelerator designs are highly similar. Then, instead of considering all the possible architectures, we optimize the accelerator design only in combination with this small set of architectures, thus significantly reducing the total search cost."

To prove the concept, the team set up a MAESTRO open source deep neural network (DNN) simulator as a benchmark, running 5,000 different hardware-dataflow combinations using NAS-Bench-301 and AlphaNet. For the former, the team's approach found an optimal architecture with a search cost of 3.7k, to the 135k required by a more traditional approach — dramatically reducing the time taken while still providing an optimal solution with predictable performance and power usage.

"Concretely," the researchers conclude, "we demonstrate latency and energy monotonicity among different accelerators, and use just one proxy accelerator's optimal architecture set to avoid searching over the entire architecture space. Compared to the SOTA [State-of-the-Art] co-designs, our approach can reduce the total design complexity by orders of magnitude, without losing optimality."

The team's work is available under open-access terms on Cornell's arXiv preprint server following its presentation at the tinyML Research Symposium 2022; the source code, based on MAESTRO, is available on GitHub under the permissive MIT license.

Gareth Halfacree
Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire: freelance@halfacree.co.uk.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles