Is This the Polaroid of Generative AI?

Using TensorRT, images can be generated twice as fast by Stable Diffusion XL without sacrificing image quality.

Comparing different model quantization methods (📷: NVIDIA)

Unlike traditional computer programming methods where the logic is clearly defined by humans, machine learning models instead learn to perform their function through a training process where they make observations. As they examine large datasets and work to align the inputs with the expected outputs, a network of connections, of varying strengths, between nodes is built up. Exactly how a model arrives at its final state, and how that state correlates with the function that it performs can be difficult to fully understand.

But one thing that has become clear is that since every bit of these models is not carefully designed by engineers seeking an optimal solution, the end result can be on the clumsy side. For this reason, many efforts have been undertaken to streamline machine learning algorithms after the training process has been completed. These efforts have tended to focus on pruning segments of the model away, or quantizing its weights, such that it becomes smaller. The result is a new algorithm that performs essentially the same function, yet runs faster and requires less computational resources — that is if these steps do not reduce the algorithm’s performance unacceptably, of course.

Can you spot the difference? (📷: NVIDIA)

When it comes to diffusion models, of the sort that power popular image generation tools like Stable Diffusion, these tricks do not work. Because diffusion models have a multi-step noise removal process, and because the amount of noise can change significantly at each step, applying a simple quantization method becomes difficult.

There are existing techniques, like SmoothQuant, that shift the quantization challenge from activations to weights, using a mathematically equivalent transformation, to maintain accuracy. Despite the effectiveness of this approach, a team at NVIDIA noticed that it can be very difficult to use. A number of parameters must be manually defined, for example. Furthermore, SmoothQuant struggles when faced with diverse image characteristics, and only works with one particular type of diffusion model.

For this reason, the team built a new feature into their TensorRT library, which is designed to optimize the inference performance of large models. Using this new feature, a tuning pipeline can be leveraged to automatically determine the optimal parameter settings to use with SmoothQuant. A new technique, called Percentile Quant was also introduced. This ensures that the quantization is tailored to the specific needs of the image denoising process. Furthermore, TensorRT provides a generalized solution that is applicable to more types of models, and fortunately, it is also much easier to implement than a custom solution.

The performance gains are significant (📷: NVIDIA)

When using TensorRT, it was shown that Stable Diffusion XL image generation times were nearly twice as fast. And judging by the examples presented, it does not look like image quality was sacrificed to achieve that speed up.

If you have an NVIDIA GPU handy and want to try it out for yourself, this blog post contains step-by-step directions to get you up and running quickly. Source code is also available in this GitHub repository.

machine learning

artificial intelligence

energy efficiency

Nick Bild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Is This the Polaroid of Generative AI?

Using TensorRT, images can be generated twice as fast by Stable Diffusion XL without sacrificing image quality.

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles