NVIDIA Demonstrates Better LLM AI Through LoRA-Tuned Models Optimized in TensorRT-LLM

Minimized-training approach to customizing LLMs aims to lower the barrier to entry, while delivering multiple variants on a single GPU.

Gareth Halfacree
4 months ago β€’ Machine Learning & AI

NVIDIA engineer Amit Bleiweiss has penned a guide to working with low-rank adaptation (LoRA) for better large language models (LLMs) using the company's open source TensorRT-LLM to accelerate performance on compatible graphics processors.

"Customizing LLMs is a challenging task, often requiring a full training process that is time-consuming and computationally expensive. Moreover, training LLMs requires a diverse and representative dataset, which can be difficult to obtain and curate," Bleiweiss explains. "One promising solution is Low-Rank Adaptation (LoRA), a fine-tuning method that can significantly reduce the number of trainable parameters, the memory requirement, and the training time, while achieving comparable or even better performance than fine-tuning on various NLP [Natural Language Processing] tasks and domains."

Designed to make it easier to fine-tune LLMs without having to go through the whole training process again, LoRA adds low-rank matrices into the LLM and trains only those β€” leaving the original training weights as-is. There's another advantage to the approach, too: "By loading a single base model together with the low-rank matrices A and B for each respective LoRA tuned variant," Bleiwiess notes, "it’s possible to store thousands of LLMs and run them dynamically and efficiently within a minimal GPU memory footprint."

To demonstrate the concept, Bleiweiss' guide walks through using pre-tuned LLMs from the Hugging Face platform and optimizing them using NVIDIA's recently-released open source TensorRT-LLM library β€” running both a single version of the model and two LoRA checkpoints to show how it doesn't result in a doubling of the memory requirements as you might expect. "With baseline support for many popular LLM architectures," he claims, "TensorRT-LLM makes it easy to deploy, experiment, and optimize with a variety of code LLMs."

The full guide is now available on the NVIDIA developer blog; those interested in trying it out will require a graphics card compatible with TensorRT-LLM β€” which means one based on the Volta, Turing, Ampere, Hopper, or Ada Lovelace architectures.

Gareth Halfacree
Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire: freelance@halfacree.co.uk.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles