Cache Me If You Can

ARCANE boosts AI performance by computing directly in cache memory to reduce data movement and deliver up to 305x speedups over CPUs alone.

Today’s predominant computing architectures were not designed with artificial intelligence (AI) in mind. The massive amount of data that has to be transferred between memory and processing units to train a large AI model will cause traditional computing systems to run slower than molasses in January. But AI is a transformative technology that is here to stay, so we have to find ways to make things work without going back to the drawing board on computer design. For this reason, all sorts of AI accelerators, such as GPUs, TPUs, and VPUs, have been developed to give existing computers a speed boost.

But while these accelerators can and do massively speed up AI workloads, data does still have to be moved between memory and the accelerator to varying extents. As such, each hardware option comes with its own set of tradeoffs, with none of them being completely ideal for every use case. The perfect solution might involve in-memory computing, but in practice, these systems tend to lack flexibility and scalability due to the specialized technologies that are required. For a fast-paced and growing field like AI, these compromises are often found to be unacceptable.

A block diagram of ARCANE (📷: V. Petrolo et al.)

Researchers at the Polytechnic University of Turin and the Swiss Federal Technology Institute of Lausanne recently highlighted another option called near-memory computing (NMC) that may be appropriate for a wider range of AI workloads. Because they leverage standard digital design flows, NMC systems offer a more scalable and practical solution in many cases. In particular, the team dug into an NMC-based cache-integrated computing architecture known as ARCANE to see how much of a boost it can provide over CPU-only systems.

ARCANE integrates Vector Processing Units (VPUs) directly into the data cache of a computing system. This approach significantly cuts down on the time and energy wasted moving data back and forth between processors and memory. It does so through a custom instruction set extension called xmnmc, which simplifies memory management and enables machine learning kernels to run directly within the cache.

This unique in-cache computing paradigm avoids the memory bottlenecks that plague traditional von Neumann architectures. Instead of sending data on a long round trip to memory and back, ARCANE keeps operations local by locking a portion of the cache during execution and handling operand transfers with a lightweight software-controlled direct memory access scheme.

An illustration of a matrix multiplication on an ARCANE VPU (📷: V. Petrolo et al.)

In a series of experiments, ARCANE delivered up to a 150x speedup in 2D convolution tasks, which are a key operation in many computer vision models. For linear layers, which are fundamental in neural networks, ARCANE achieved a 305x improvement. Even in Transformer-based operations like Fused-Weight Self-Attention, which are commonly used in language models, it offered up a 32x acceleration.

In the fast-moving field of AI, it doesn’t hurt to have another tool in your toolbox. ARCANE might be just what you need to keep your latest project idea from stalling out.

nickbild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Latest Articles