The Lean Mean Bloat-Reducing AI Optimization Machine
NVIDIA's TensorRT Model Optimizer simplifies ML model optimization, using cutting-edge techniques to reduce model size and increase speed.
Machine learning is turning the traditional paradigm of how we program computers on its head. Rather than meticulously specifying exactly how a program should act under every condition in code, machine learning applications instead program themselves by learning from examples. This has proven to be hugely successful, giving us all sorts of tools that would otherwise be virtually impossible to create. I mean, can you even imagine specifying the logic necessary to recognize a cat in an image, let alone generate any image that a user asks for via a text prompt?
Today’s machine learning algorithms, especially the very large, cutting-edge ones, are built primarily for accuracy, with efficiency being of secondary importance. As a result, these models tend to be bloated, containing a lot of redundant and irrelevant information in their parameters. This is bad on a number of fronts — super-sized models require very expensive hardware and lots of energy for operation, which makes them less accessible and completely impractical for many use cases. They also take longer to run, which can make real-time applications impossible.
These are well-known problems, and a number of optimization techniques have been introduced in recent years that seek to reduce model bloat without hurting accuracy levels. Applying these techniques to a model, and doing so correctly, can be challenging for many developers, however, so NVIDIA recently released a tool called the TensorRT Model Optimizer to simplify the process. The Model Optimizer contains a library of post-training and training-in-the-loop model optimization techniques to slash model sizes and increase inference speeds.
One of the ways that this goal is achieved is through the use of advanced quantization techniques. Algorithms such as INT8 SmoothQuant and Activation-aware Weight Quantization are available for model compression, in addition to more basic weight-only quantization methods. Quantization alone can very significantly increase inference speeds, often with only a negligible drop in performance. The upcoming NVIDIA Blackwell platform, with its 4-bit floating point AI inference support, will reap some major benefits from these techniques.
The Model Optimizer is capable of further compressing models with sparsity. By analyzing a model after it has been trained, these methods can trim off segments that do not contribute to the model’s performance in any meaningful way. In an experiment, it was shown that sparsity could reduce the size of the Llama 2 70-billion parameter large language model by 37 percent. This huge reduction in size came with virtually no decrease in performance.
As a part of the TensorRT framework, the Model Optimizer can be integrated into existing development and deployment pipelines. Getting started is as simple as issuing a “pip install” command, and NVIDIA has extensive documentation available to get developers up and running in no time.