NVIDIA's TensorRT Edge-LLM is a high-performance C++ inference runtime designed to run Large Language Models (LLMs) and Vision-Language Models (VLMs) on edge platforms such as the NVIDIA Jetson Thor.
In this guide, we will walk through the complete pipeline for deploying Qwen3-TTS (a text-to-speech model) and Nemotron-3-Nano-4B using TensorRT Edge-LLM. The workflow spans two machines: an x86 workstation with an NVIDIA GPU for model export, and an NVIDIA Jetson Thor for engine building and inference.
By the end of this tutorial, you will have a fully working pipeline that converts HuggingFace models into optimized TensorRT engines to generate speech audio and text directly on the edge.
What's New in v0.6.0 of NVIDIA's TensorRT Edge-LLM?This tutorial uses TensorRT Edge-LLM v0.6.0, released on March 16, 2026. This third public release adds end-to-end support for:
- Qwen3-TTS and Qwen3-ASR
- Nemotron-Nano-9B-v2
- Day-0 support for Nemotron-3-Nano-4B
- Qwen3-30B-A3B MoE (via an INT4 MoE Plugin)
Featured at GTC 2026, the flagship demo showcased ASR, LLM, and TTS running simultaneously on an NVIDIA Jetson AGX Thor, enabling real-time multimodal interaction for autonomous vehicles and robotics.
Qwen3-TTS is a text-to-speech model from Alibaba's Qwen team. It uses a discrete multi-codebook LM architecture that enables end-to-end speech modeling, bypassing the information bottlenecks of traditional LM+DiT schemes. The model covers 10 major languages and features intelligent text understanding for adaptive control of tone, speaking rate, and emotional expression.
It consists of three core components:
- Talker: The main LLM that processes text input and generates codec tokens.
- CodePredictor: A secondary LLM that predicts residual vector quantization (RVQ) codes.
- Code2Wav (tokenizer_decoder): A vocoder that converts codec tokens into audio waveforms.
Note: The supported model for this guide is Qwen3-TTS-12Hz-1.7B-CustomVoice. All models support FP16 precision.PrerequisitesBefore starting, ensure you have the following hardware and software ready for both machines.
💻 x86 Host Machine (For Model Export)Quantization is compute-heavy and designed for x86 with full CUDA/cuBLAS support.
- OS: x86-64 Linux system (Ubuntu 22.04 or 24.04 recommended)
- GPU: NVIDIA GPU with Compute Capability 8.0+ (Ampere or newer)
- Software: CUDA 12.x or 13.x, Python 3.10+
- Environment: Docker (recommended) or a Python virtual environment
- Hardware: NVIDIA Jetson AGX Thor Developer Kit
- OS: JetPack 7.1
- Software: CUDA 13.x and TensorRT 10.x+ (included in JetPack)
- Storage: 20–50 GB disk space for ONNX files and TensorRT engines
The Python export pipeline converts the Qwen3-TTS model components into ONNX format. This step runs on your x86 Linux workstation.
Note: For this tutorial, I'm using a workstation equipped with an NVIDIA GeForce RTX 5090.Step 1.1: Verify Your GPU and CUDA Installation
Confirm that your GPU is properly detected and CUDA is working:
nvidia-smiExpected output:
Check CUDA installation (Should show CUDA 12.x or 13.x)
nvcc --versionOutput:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2026 NVIDIA Corporation
Built on Mon_Mar_02_09:52:23_PM_PST_2026
Cuda compilation tools, release 13.2, V13.2.51
Build cuda_13.2.r13.2/compiler.37434383_0If CUDA is not installed, download and install the CUDA Toolkit from the NVIDIA CUDA Downloads page.
Step 1.2: Clone the RepositoryClone the TensorRT Edge-LLM repository and initialize submodules:
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursiveStep 1.3: Install TTS-Specific DependenciesThe Qwen3-TTS export pipeline requires the qwen-tts package, which has dependency conflicts with the standard environment. You must use a dedicated virtual environment for Qwen3-TTS export until it is merged into HuggingFace Transformers.
cd TensorRT-Edge-LLM
uv venv .qwen3-tts --python 3.12
source .qwen3-tts/bin/activate
uv pip install qwen-tts
uv pip install .
uv pip install --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130Step 1.4: Download the Modelhuggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
--local-dir ./Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoiceStep 1.5: Export the Model ComponentsQwen3-TTS has three components that must be exported separately. First, set your environment variables:
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export TTS_MODEL=Qwen3-TTS-12Hz-1.7B-CustomVoice
export ONNX_OUTPUT_DIR=$WORKSPACE_DIR/$TTS_MODEL/onnx
export TTS_CHAT_TEMPLATE=./tensorrt_edgellm/chat_templates/templates/qwen3tts.jsonExport the Talker and CodePredictor (LLM components):
tensorrt-edgellm-export-llm \
--model_dir Qwen/$TTS_MODEL \
--output_dir $ONNX_OUTPUT_DIR \
--chat_template $TTS_CHAT_TEMPLATE \
--export_models talker,code_predictorExport the Code2Wav vocoder:
tensorrt-edgellm-export-audio \
--model_dir Qwen/$TTS_MODEL \
--output_dir $ONNX_OUTPUT_DIR \
--export_models tokenizer_decoderStep 1.6: Transfer the ONNX Model to Jetson ThorOnce the export is complete, transfer the output folder to your Jetson Thor device via SCP:
scp -r $ONNX_OUTPUT_DIR \
<user>@<device>:~/tensorrt-edgellm-workspace/$TTS_MODEL/(Replace <user> with your Jetson username and <device> with the IP address).
Switch over to your Jetson Thor device. The C++ runtime builds TensorRT engines from ONNX files and runs inference natively without Python dependencies.
Step 2.1: Install System Dependencies & VerifyInstall the required build tools and verify your JetPack 7.1 installation:
sudo apt update
sudo apt install -y cmake build-essential git
# Verify CUDA 13.x
nvcc --version
# Verify TensorRT 10.x+
dpkg -l | grep tensorrtStep 2.2: Clone and Buildcd ~
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive
mkdir build && cd build
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DTRT_PACKAGE_DIR=/usr \
-DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
-DEMBEDDED_TARGET=jetson-thor
# Build time is approximately 1–2 minutes
make -j$(nproc)Verify the build was successful:
./examples/llm/llm_build --help
./examples/llm/llm_inference --helpPart 3: Building Engines and Running TTS InferenceStep 3.1: Set Environment Variablescd ~/TensorRT-Edge-LLM
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export TTS_MODEL=Qwen3-TTS-12Hz-1.7B-CustomVoice
export ONNX=$WORKSPACE_DIR/$TTS_MODEL/onnx
export ENG=$WORKSPACE_DIR/$TTS_MODEL/enginesStep 3.2: Build the TensorRT EnginesThree engine builds are required.
Important:--maxBatchSizemust be set to1, as the Qwen3-TTS ONNX export uses a fixed batch size.
1. Talker LLM engine:
./build/examples/llm/llm_build \
--onnxDir $ONNX/llm/talker \
--engineDir $ENG/talker \
--maxInputLen 4096 \
--maxKVCacheCapacity 4096 \
--maxBatchSize 12. CodePredictor LLM engine:
./build/examples/llm/llm_build \
--onnxDir $ONNX/llm/code_predictor \
--engineDir $ENG/code_predictor \
--maxInputLen 4096 \
--maxKVCacheCapacity 4096 \
--maxBatchSize 13. Code2Wav engine:
./build/examples/multimodal/audio_build \
--onnxDir $ONNX/audio/tokenizer_decoder \
--engineDir $ENG/code2wavStep 3.3: Copy Tokenizer and Chat Template Filescp $ONNX/llm/*.json $ENG/Step 3.4: Create an Input File & Run InferenceCreate your input JSON file. Built-in speakers include: ryan, serena, aiden, vivian, dylan, eric, uncle_fu, ono_anna, and sohee.
cat > input.json << 'EOF'
{
"talker_temperature": 0.9,
"talker_top_k": 50,
"repetition_penalty": 1.05,
"speaker": "ryan",
"requests": [
{
"messages": [{"role": "assistant", "content": "Hello, how can I help you today?"}]
},
{
"speaker": "serena",
"messages": [{"role": "assistant", "content": "The weather is sunny and warm."}]
}
]
}
EOFRun the inference:
./build/examples/omni/qwen3_tts_inference \
--talkerEngineDir $ENG/talker \
--code2wavEngineDir $ENG/code2wav \
--tokenizerDir $ENG \
--inputFile input.json \
--outputFile output.json \
--outputAudioDir ./audio_outputGenerated .wav files and metadata will be saved to your output directory!
Deploying Nemotron-3-Nano-4B follows a similar pipeline, but requires strict adherence to precision and batch size rules
CRITICAL CONSTRAINTS FOR NEMOTRON-4B:Precision: Export MUST use BF16 (no quantization). FP8 quantization breaks Mamba SSM layers during TensorRT compilation.Batch Size: Engine MUST be built with --maxBatchSize=2. Using batch size 1 causes garbage/degenerate output due to a known limitation.Part 1: Export on x86 Host (BF16)export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=NVIDIA-Nemotron-3-Nano-4B-BF16
mkdir -p $WORKSPACE_DIR/$MODEL_NAME
tensorrt-edgellm-export-llm \
--model_dir nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
--output_dir $WORKSPACE_DIR/$MODEL_NAME/onnx-bf16Transfer these files to the Jetson device via SCP as done previously.
Part 2: Build Engine & Infer on Jetson ThorSSH into your Jetson and run the build command, ensuring the batch size is set correctly:
export ONNX=/home/jetson/Projects/TensorRT-Edge-LLM/nemotron-4b/onnx-bf16
export ENG=/home/jetson/Projects/TensorRT-Edge-LLM/nemotron-4b/engines-bf16-bs2
cd ~/Projects/TensorRT-Edge-LLM
./build/examples/llm/llm_build \
--onnxDir $ONNX \
--engineDir $ENG \
--maxBatchSize 2 \
--maxInputLen 4096 \
--maxKVCacheCapacity 4096Create a simple test input:
cat > /tmp/nemotron_input.json << 'EOF'
{
"temperature": 0.6,
"top_p": 0.9,
"top_k": 50,
"max_generate_length": 512,
"requests": [
{
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}
]
}
EOFRun the inference:
./build/examples/llm/llm_inference \
--engineDir $ENG \
--inputFile /tmp/nemotron_input.json \
--outputFile /tmp/nemotron_output.jsonOutput:
cat /tmp/nemotron_output.json
You should see:
{
"responses": [
{
"output_text": "The capital of France is Paris.<|im_end|>",
"request_idx": 0
}
]
}For more details and additional examples (VLM, ASR, speculative decoding), see the full TensorRT Edge-LLM documentation and the GitHub repository.


Comments