NVIDIA Triples LLM Inference Performance with TensorRT-LLM's "Speculative Decoding" Trick
Running a smaller model before a cutting-edge large model delivers a big performance gain, NVIDIA claims.
NVIDIA engineers have highlighted a feature of the company's TensorRT-LLM tool that, they say, can provide a tripling of performance for the inference stage of large language model (LLM) operation — using speculative decoding, now available on single-GPU and single-node multi-GPU platforms.
"TensorRT-LLM is an open-source library that provides blazing-fast inference support for numerous popular large language models (LLMs) on NVIDIA GPUs," a joint post from NVIDIA team members including Carl "Izzy" Putterman explains. "Speculative decoding, also referred to as speculative sampling, works by paying a small additional computation cost to speculatively generate the next several tokens and then using the target model to perform a built-in verification step to ensure the quality of output generation while giving a throughput boost."
LLMs are the focus of considerable interest thanks to their ability to generate text, images, video, and even audio from input prompts — using what is, effectively, a token-based autocomplete system based on training with a vast corpus of data. While there are quality and ethical concerns — the former typically centered around the technology's tendency to "hallucinate" incorrect or impossible responses and the latter around the growing energy demands and unauthorized hoovering up of copyright content to be chewed up and spat back out by the models — the technology is undeniably popular, with companies from Anthropic and OpenAI to Google and Meta constantly releasing new and refined models to one-up the competition.
Like any machine learning system, LLMs require vast computational resources during the training stage — but also at the point of use, during the inference stage. It's here, NVIDIA says, that TensorRT-LLM's speculative decoding feature can deliver improvements — offering a tripling or better in performance, measured in output tokens per second, for popular LLMs. The trick: running two models sequentially, a lightweight high-performance "draft model" and a larger, slower "target model" — using the output of the first to inform the operation of the latter.
"As long as the draft model is sufficiently faster than the target model while also maintaining a high enough acceptance rate," the team claims of the tool's operation, "the speculative sampling yields a lower end-to-end request latency by generating statistically more than one token per iteration."
More information is available on the NVIDIA Developer blog, along with a step-by-step tutorial in trying TensorRT-LLM for yourself — providing you have a compatible NVIDIA GPU or access to a suitable NVIDIA-powered cloud platform.