Making the Grade Without a Huge Upgrade

MLP-Offload is a new tool that helps modest GPUs run cutting-edge generative AI algorithms with great performance.

Nick Bild
2 hours agoAI & Machine Learning

The best GPUs cost as much as a car these days, so if you are experimenting with AI at home and are not the sort who uses hundred dollar bills for kindling, you are going to have to make some compromises. This is especially true when working in the area of generative AI, where models with hundreds of billions of parameters are the norm. Before training or running an inference, that entire model has to be loaded into GPU memory, so you can’t skimp too much on hardware.

Most of us have to, however, or we will be left out of the game entirely. So to make do with hardware that isn’t up to snuff, a variety of disk offloading techniques have been developed. That may sound fancy, but it is essentially the same old disk swapping technique that has been around forever that we all know and hate. If you care a thing about performance, then this I/O bottleneck is one you want to avoid at all costs.

But given the costs, we may not be able to avoid it. Fortunately, a team led by researchers at the Argonne National Laboratory has come up with a novel way to make the best of a bad situation. They have developed a system called MLP-Offload that offloads the contents of GPU memory to other locations (e.g., caches, system memory, or disk), but it does so with intelligence. Through careful optimization, the performance hit of this memory offloading is minimized.

The researchers observed that much of the potential bandwidth in modern high-performance computing setups — such as parallel file systems and object stores — typically goes completely unused during training. By harnessing this capacity alongside local NVMe drives, MLP-Offload increases the available bandwidth. It also applies concurrency controls to reduce contention between multiple GPUs that are all trying to read and write at once. Add to that smarter caching strategies, reusing data already staged in system memory rather than shuttling it back and forth unnecessarily, and the system manages to claw back a surprising amount of performance.

In experiments training models up to 280 billion parameters on clusters of NVIDIA A100 GPUs, MLP-Offload delivered a 2.5x overall speedup compared with existing offloading frameworks like DeepSpeed. The backward pass and parameter update phases, which are traditionally the most I/O bound, saw the largest improvements — up to 13.5 times faster in some cases.

For those without access to the latest and greatest GPUs, approaches like this may make the difference between being shut out of cutting-edge AI entirely and being able to meaningfully participate. This problem isn’t going away any time soon, but if solutions like MLP-Offload continue to mature, we may at least be able to scale up without completely breaking the bank.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles