More Bang for Your Bits
Microsoft's breakthroughs in AI efficiency — Ladder, T-MAC, and the LUT Tensor Core — enable optimized AI to run faster on everyday devices.
The latest and greatest applications in artificial intelligence (AI) — especially generative AI tools — most commonly run on very powerful computing clusters located in remote data centers. This is no more the goal than performing relatively simple calculations on systems so large that they filled up an entire room was just over half a century ago. It is just a reflection of where we are technologically at this point in time. Ideally, these cutting-edge algorithms would run on small, low-power systems, right where they are needed. This would make it possible to develop real-time applications that leverage these tools, and would also be a boon to data privacy.
Engineers are presently working around the clock to make this goal a reality. One approach that has gained favor in recent years involves a process called quantization. This process reduces the memory and computational requirements of AI models by representing their parameters with fewer bits. Large language models, which can have billions of parameters, traditionally rely on 32-bit or 16-bit floating-point precision for computation. However, running these models on resource-constrained edge devices like smartphones, laptops, and robots requires compressing them to lower-bit representations (such as 8-bit, 4-bit, or even 2-bit formats).
Despite its promise, low-bit quantization presents some significant challenges. One major issue is that hardware typically supports only symmetric computations, meaning operations must use similar data formats. However, modern quantization techniques rely on mixed-precision computations — where different parts of a model use varying bit depths to balance accuracy and efficiency. Standard hardware struggles to support such asymmetrical operations, limiting the benefits of low-bit quantization.
To overcome these obstacles, researchers at Microsoft have developed a three-part solution to improve support for mixed-precision general matrix multiplication (mpGEMM): the Ladder data type compiler, the T-MAC mpGEMM library, and the LUT Tensor Core hardware architecture. These innovations are designed to optimize computations, reduce overhead, and enable efficient AI inference on edge devices.
The Ladder data type compiler acts as a bridge between unsupported low-bit data types and existing hardware capabilities. It translates emerging data formats into hardware-supported ones without loss of information. By doing so, Ladder enables AI models to run efficiently on existing chips, even if those chips were not explicitly designed for the latest quantization techniques. Microsoft’s evaluations show that Ladder outperforms existing compilers and achieves speedups of up to 14.6 times over previous methods.
Another major bottleneck in deploying quantized AI models is the computational cost of matrix multiplication. Traditionally, low-bit models require dequantization, converting compressed values back into higher precision before multiplication, which negates much of the efficiency gain. The T-MAC mpGEMM library eliminates this problem by replacing multiplication with a lookup table (LUT) approach. Instead of performing costly arithmetic operations, T-MAC precomputes results and stores them in memory, allowing the system to retrieve values almost instantly, dramatically reducing computational overhead.
While Ladder and T-MAC optimize AI computations on existing CPUs and GPUs, even greater efficiency gains require dedicated hardware. This is where LUT Tensor Core comes in — a new architecture designed specifically for low-bit quantization and mixed-precision calculations. LUT Tensor Core introduces a software-hardware co-design approach that tackles key challenges in LUT-based inference, including efficient table storage and reuse to reduce memory overhead, flexible bit-width support for diverse AI models, and optimized instruction sets for better integration with modern AI frameworks.
By adopting these innovations, the team achieved a 6.93x increase in inference speed while using just 38.3% of the area of a traditional Tensor Core. Additionally, the LUT-based approach resulted in a 20.9x boost in computational density and an 11.2x improvement in energy efficiency.
Microsoft has made T-MAC and Ladder open source, inviting researchers and developers to experiment with these technologies and further push the boundaries of AI on edge devices. These advancements could help usher in a new era where powerful AI runs on everyday devices, bringing intelligence closer to where it is needed most.