Tensor Manipulation Units Promise to Slash the Latency for Machine Learning, Artificial Intelligence

Tiny accelerator delivers outsized gains, reducing end-to-end inference latency in a custom AI chip by more than a third.

Researchers from the University of Macau, the Chinese Academy of Sciences, University College Dublin, and Nanyang Technological University have turned their focus away from building better tensor processing units (TPUs) in order to meet demand for better machine learning and artificial intelligence (ML and AI) acceleration at a different point — with a tensor manipulation unit instead.

"While recent advances in AI SoC [System-on-Chip] design have focused heavily on accelerating tensor computation, the equally critical task of tensor manipulation, centered on high,volume data movement with minimal computation, remains under-explored," the researchers explain. "This work addresses that gap by introducing the Tensor Manipulation Unit (TMU), a reconfigurable, near-memory hardware block designed to efficiently execute data-movement-intensive operators."

The explosion of interest in generative artificial intelligence has led to dramatically increased demand for more computational power, both for training models and for working with them. Rather than throwing more tensor-processing capabilities at the problem, though, the researchers are looking to accelerate the manipulation of tensors, the multilinear transformations underpinning artificial intelligence algorithms.

To prove the concept, the team used SMIC's established 40nm process node to fabricate a "tensor manipulation unit," or TMU, designed for integration into a system-on-chip targeting machine learning and artificial intelligence workloads. Despite taking up just 0.019mm² of die space, the ten-operation TMU — using an execution model inspired by reduced instruction set computing (RISC) — delivered a claimed 1,413.43x reduction in operator-level latency compared to Arm's general-purpose Cortex-A72 processor and 8.54x over NVIDIA's Jetson TX2 AI-specific system-on-chip. Integrated with an in-house TPU, the overall gain is reported as a 34.6 percent reduction in end-to-end inference latency.

A preprint of the team's work is available on Cornell's arXiv server under open-access terms.

Gareth Halfacree
Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire: freelance@halfacree.co.uk.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles