NVIDIA Demonstrates Better LLM AI Through LoRA-Tuned Models Optimized in TensorRT-LLM

Minimized-training approach to customizing LLMs aims to lower the barrier to entry, while delivering multiple variants on a single GPU.

NVIDIA engineer Amit Bleiweiss has penned a guide to working with low-rank adaptation (LoRA) for better large language models (LLMs) using the company's open source TensorRT-LLM to accelerate performance on compatible graphics processors.

"Customizing LLMs is a challenging task, often requiring a full training process that is time-consuming and computationally expensive. Moreover, training LLMs requires a diverse and representative dataset, which can be difficult to obtain and curate," Bleiweiss explains. "One promising solution is Low-Rank Adaptation (LoRA), a fine-tuning method that can significantly reduce the number of trainable parameters, the memory requirement, and the training time, while achieving comparable or even better performance than fine-tuning on various NLP [Natural Language Processing] tasks and domains."

NVIDIA's Amit Bleiweiss has written a guide to using TensorRT-LLM to deliver LoRA-customized large language models (LLMs). (📷: NVIDIA)

Designed to make it easier to fine-tune LLMs without having to go through the whole training process again, LoRA adds low-rank matrices into the LLM and trains only those — leaving the original training weights as-is. There's another advantage to the approach, too: "By loading a single base model together with the low-rank matrices A and B for each respective LoRA tuned variant," Bleiwiess notes, "it’s possible to store thousands of LLMs and run them dynamically and efficiently within a minimal GPU memory footprint."

To demonstrate the concept, Bleiweiss' guide walks through using pre-tuned LLMs from the Hugging Face platform and optimizing them using NVIDIA's recently-released open source TensorRT-LLM library — running both a single version of the model and two LoRA checkpoints to show how it doesn't result in a doubling of the memory requirements as you might expect. "With baseline support for many popular LLM architectures," he claims, "TensorRT-LLM makes it easy to deploy, experiment, and optimize with a variety of code LLMs."

The full guide is now available on the NVIDIA developer blog; those interested in trying it out will require a graphics card compatible with TensorRT-LLM — which means one based on the Volta, Turing, Ampere, Hopper, or Ada Lovelace architectures.

machine learning

artificial intelligence

opensource

Gareth Halfacree

Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire: freelance@halfacree.co.uk.

NVIDIA Demonstrates Better LLM AI Through LoRA-Tuned Models Optimized in TensorRT-LLM

Minimized-training approach to customizing LLMs aims to lower the barrier to entry, while delivering multiple variants on a single GPU.

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles