What's New in v0.6.0 of NVIDIA's TensorRT Edge-LLM
What is Qwen3-TTS
Prerequisites
💻 x86 Host Machine (For Model Export
⚡ Jetson Thor (For Engine Build & Inference
Part 1: Setting Up the Python Export Pipeline on x86
Step 1.1: Verify Your GPU and CUDA Installation
Step 1.2: Clone the Repository
Step 1.3: Install TTS-Specific Dependencies
Step 1.4: Download the Model
Step 1.5: Export the Model Components
Step 1.6: Transfer the ONNX Model to Jetson Thor
Part 2: Building the C++ Runtime on Jetson Thor
Step 2.1: Install System Dependencies & Verify
Step 2.2: Clone and Build
Part 3: Building Engines and Running TTS Inference
Step 3.1: Set Environment Variables
Step 3.2: Build the TensorRT Engines
Step 3.3: Copy Tokenizer and Chat Template Files
Step 3.4: Create an Input File & Run Inference
Bonus: Running Nemotron-3-Nano-4B on Jetson Thor
Part 1: Export on x86 Host (BF16
Part 2: Build Engine & Infer on Jetson Thor

Published March 23, 2026 © GPL3+

Getting Started with Nvidia TensorRT-Edge-LLM on Jetson Thor

Deploying Qwen3-TTS and Nemotron-3-Nano-4B on NVIDIA Jetson Thor with TensorRT Edge-LLM

AdvancedFull instructions provided3 hours37

Getting Started with Nvidia TensorRT-Edge-LLM on Jetson Thor

Story

NVIDIA's TensorRT Edge-LLM is a high-performance C++ inference runtime designed to run Large Language Models (LLMs) and Vision-Language Models (VLMs) on edge platforms such as the NVIDIA Jetson Thor.

In this guide, we will walk through the complete pipeline for deploying Qwen3-TTS (a text-to-speech model) and Nemotron-3-Nano-4B using TensorRT Edge-LLM. The workflow spans two machines: an x86 workstation with an NVIDIA GPU for model export, and an NVIDIA Jetson Thor for engine building and inference.

By the end of this tutorial, you will have a fully working pipeline that converts HuggingFace models into optimized TensorRT engines to generate speech audio and text directly on the edge.

What's New in v0.6.0 of NVIDIA's TensorRT Edge-LLM?

This tutorial uses TensorRT Edge-LLM v0.6.0, released on March 16, 2026. This third public release adds end-to-end support for:

Qwen3-TTS and Qwen3-ASR
Nemotron-Nano-9B-v2
Day-0 support for Nemotron-3-Nano-4B
Qwen3-30B-A3B MoE (via an INT4 MoE Plugin)

Featured at GTC 2026, the flagship demo showcased ASR, LLM, and TTS running simultaneously on an NVIDIA Jetson AGX Thor, enabling real-time multimodal interaction for autonomous vehicles and robotics.

What is Qwen3-TTS?

Qwen3-TTS is a text-to-speech model from Alibaba's Qwen team. It uses a discrete multi-codebook LM architecture that enables end-to-end speech modeling, bypassing the information bottlenecks of traditional LM+DiT schemes. The model covers 10 major languages and features intelligent text understanding for adaptive control of tone, speaking rate, and emotional expression.

It consists of three core components:

Talker: The main LLM that processes text input and generates codec tokens.
CodePredictor: A secondary LLM that predicts residual vector quantization (RVQ) codes.
Code2Wav (tokenizer_decoder): A vocoder that converts codec tokens into audio waveforms.

Note: The supported model for this guide is Qwen3-TTS-12Hz-1.7B-CustomVoice. All models support FP16 precision.

Prerequisites

Before starting, ensure you have the following hardware and software ready for both machines.

💻 x86 Host Machine (For Model Export)

Quantization is compute-heavy and designed for x86 with full CUDA/cuBLAS support.

OS: x86-64 Linux system (Ubuntu 22.04 or 24.04 recommended)
GPU: NVIDIA GPU with Compute Capability 8.0+ (Ampere or newer)
Software: CUDA 12.x or 13.x, Python 3.10+
Environment: Docker (recommended) or a Python virtual environment

⚡ Jetson Thor (For Engine Build & Inference)

Hardware: NVIDIA Jetson AGX Thor Developer Kit
OS: JetPack 7.1
Software: CUDA 13.x and TensorRT 10.x+ (included in JetPack)
Storage: 20–50 GB disk space for ONNX files and TensorRT engines

Part 1: Setting Up the Python Export Pipeline on x86

The Python export pipeline converts the Qwen3-TTS model components into ONNX format. This step runs on your x86 Linux workstation.

Note: For this tutorial, I'm using a workstation equipped with an NVIDIA GeForce RTX 5090.

Step 1.1: Verify Your GPU and CUDA Installation

Confirm that your GPU is properly detected and CUDA is working:

nvidia-smi

Expected output:

RTX 5090 detected with CUDA 13.2

Check CUDA installation (Should show CUDA 12.x or 13.x)

nvcc --version

Output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2026 NVIDIA Corporation
Built on Mon_Mar_02_09:52:23_PM_PST_2026
Cuda compilation tools, release 13.2, V13.2.51
Build cuda_13.2.r13.2/compiler.37434383_0

If CUDA is not installed, download and install the CUDA Toolkit from the NVIDIA CUDA Downloads page.

Step 1.2: Clone the Repository

Clone the TensorRT Edge-LLM repository and initialize submodules:

git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive

Step 1.3: Install TTS-Specific Dependencies

The Qwen3-TTS export pipeline requires the qwen-tts package, which has dependency conflicts with the standard environment. You must use a dedicated virtual environment for Qwen3-TTS export until it is merged into HuggingFace Transformers.

cd TensorRT-Edge-LLM
uv venv .qwen3-tts --python 3.12
source .qwen3-tts/bin/activate

uv pip install qwen-tts
uv pip install .
uv pip install --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

Step 1.4: Download the Model

huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --local-dir ./Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

Step 1.5: Export the Model Components

Qwen3-TTS has three components that must be exported separately. First, set your environment variables:

export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export TTS_MODEL=Qwen3-TTS-12Hz-1.7B-CustomVoice
export ONNX_OUTPUT_DIR=$WORKSPACE_DIR/$TTS_MODEL/onnx
export TTS_CHAT_TEMPLATE=./tensorrt_edgellm/chat_templates/templates/qwen3tts.json

Export the Talker and CodePredictor (LLM components):

tensorrt-edgellm-export-llm \
    --model_dir Qwen/$TTS_MODEL \
    --output_dir $ONNX_OUTPUT_DIR \
    --chat_template $TTS_CHAT_TEMPLATE \
    --export_models talker,code_predictor

Export the Code2Wav vocoder:

tensorrt-edgellm-export-audio \
    --model_dir Qwen/$TTS_MODEL \
    --output_dir $ONNX_OUTPUT_DIR \
    --export_models tokenizer_decoder

Step 1.6: Transfer the ONNX Model to Jetson Thor

Once the export is complete, transfer the output folder to your Jetson Thor device via SCP:

scp -r $ONNX_OUTPUT_DIR \
    <user>@<device>:~/tensorrt-edgellm-workspace/$TTS_MODEL/

(Replace <user> with your Jetson username and <device> with the IP address).

Part 2: Building the C++ Runtime on Jetson Thor

Switch over to your Jetson Thor device. The C++ runtime builds TensorRT engines from ONNX files and runs inference natively without Python dependencies.

Step 2.1: Install System Dependencies & Verify

Install the required build tools and verify your JetPack 7.1 installation:

sudo apt update
sudo apt install -y cmake build-essential git

# Verify CUDA 13.x
nvcc --version

# Verify TensorRT 10.x+
dpkg -l | grep tensorrt

Step 2.2: Clone and Build

cd ~
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive

mkdir build && cd build

cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DTRT_PACKAGE_DIR=/usr \
    -DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
    -DEMBEDDED_TARGET=jetson-thor

# Build time is approximately 1–2 minutes
make -j$(nproc)

Verify the build was successful:

./examples/llm/llm_build --help
./examples/llm/llm_inference --help

Part 3: Building Engines and Running TTS Inference

Step 3.1: Set Environment Variables

cd ~/TensorRT-Edge-LLM
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export TTS_MODEL=Qwen3-TTS-12Hz-1.7B-CustomVoice
export ONNX=$WORKSPACE_DIR/$TTS_MODEL/onnx
export ENG=$WORKSPACE_DIR/$TTS_MODEL/engines

Step 3.2: Build the TensorRT Engines

Three engine builds are required.

Important:--maxBatchSize must be set to 1, as the Qwen3-TTS ONNX export uses a fixed batch size.

1. Talker LLM engine:

./build/examples/llm/llm_build \
    --onnxDir $ONNX/llm/talker \
    --engineDir $ENG/talker \
    --maxInputLen 4096 \
    --maxKVCacheCapacity 4096 \
    --maxBatchSize 1

2. CodePredictor LLM engine:

./build/examples/llm/llm_build \
    --onnxDir $ONNX/llm/code_predictor \
    --engineDir $ENG/code_predictor \
    --maxInputLen 4096 \
    --maxKVCacheCapacity 4096 \
    --maxBatchSize 1

3. Code2Wav engine:

./build/examples/multimodal/audio_build \
    --onnxDir $ONNX/audio/tokenizer_decoder \
    --engineDir $ENG/code2wav

Step 3.3: Copy Tokenizer and Chat Template Files

cp $ONNX/llm/*.json $ENG/

Step 3.4: Create an Input File & Run Inference

Create your input JSON file. Built-in speakers include: ryan, serena, aiden, vivian, dylan, eric, uncle_fu, ono_anna, and sohee.

cat > input.json << 'EOF'
{
    "talker_temperature": 0.9,
    "talker_top_k": 50,
    "repetition_penalty": 1.05,
    "speaker": "ryan",
    "requests": [
        {
            "messages": [{"role": "assistant", "content": "Hello, how can I help you today?"}]
        },
        {
            "speaker": "serena",
            "messages": [{"role": "assistant", "content": "The weather is sunny and warm."}]
        }
    ]
}
EOF

Run the inference:

./build/examples/omni/qwen3_tts_inference \
    --talkerEngineDir   $ENG/talker \
    --code2wavEngineDir $ENG/code2wav \
    --tokenizerDir      $ENG \
    --inputFile         input.json \
    --outputFile        output.json \
    --outputAudioDir    ./audio_output

Generated .wav files and metadata will be saved to your output directory!

Bonus: Running Nemotron-3-Nano-4B on Jetson Thor

Deploying Nemotron-3-Nano-4B follows a similar pipeline, but requires strict adherence to precision and batch size rules

CRITICAL CONSTRAINTS FOR NEMOTRON-4B:Precision: Export MUST use BF16 (no quantization). FP8 quantization breaks Mamba SSM layers during TensorRT compilation.Batch Size: Engine MUST be built with --maxBatchSize=2. Using batch size 1 causes garbage/degenerate output due to a known limitation.

Part 1: Export on x86 Host (BF16)

export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=NVIDIA-Nemotron-3-Nano-4B-BF16

mkdir -p $WORKSPACE_DIR/$MODEL_NAME

tensorrt-edgellm-export-llm \
    --model_dir nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
    --output_dir $WORKSPACE_DIR/$MODEL_NAME/onnx-bf16

Transfer these files to the Jetson device via SCP as done previously.

Part 2: Build Engine & Infer on Jetson Thor

SSH into your Jetson and run the build command, ensuring the batch size is set correctly:

export ONNX=/home/jetson/Projects/TensorRT-Edge-LLM/nemotron-4b/onnx-bf16
export ENG=/home/jetson/Projects/TensorRT-Edge-LLM/nemotron-4b/engines-bf16-bs2

cd ~/Projects/TensorRT-Edge-LLM

./build/examples/llm/llm_build \
    --onnxDir   $ONNX \
    --engineDir $ENG \
    --maxBatchSize 2 \
    --maxInputLen 4096 \
    --maxKVCacheCapacity 4096

Create a simple test input:

cat > /tmp/nemotron_input.json << 'EOF'
{
    "temperature": 0.6,
    "top_p": 0.9,
    "top_k": 50,
    "max_generate_length": 512,
    "requests": [
        {
            "messages": [
                {"role": "user", "content": "What is the capital of France?"}
            ]
        }
    ]
}
EOF

Run the inference:

./build/examples/llm/llm_inference \
    --engineDir $ENG \
    --inputFile /tmp/nemotron_input.json \
    --outputFile /tmp/nemotron_output.json

Output:

cat /tmp/nemotron_output.json
You should see:
{
  "responses": [
    {
      "output_text": "The capital of France is Paris.<|im_end|>",
      "request_idx": 0
    }
  ]
}

For more details and additional examples (VLM, ASR, speculative decoding), see the full TensorRT Edge-LLM documentation and the GitHub repository.

https://github.com/NVIDIA/TensorRT-Edge-LLM/releases

Nurgaliyev Shakhizat

83 projects • 210 followers

I am a hardcore robotics and IoT enthusiast. Email: shahizat005@gmail.com

Getting Started with Nvidia TensorRT-Edge-LLM on Jetson Thor

Story

What's New in v0.6.0 of NVIDIA's TensorRT Edge-LLM?

What is Qwen3-TTS?

Prerequisites

💻 x86 Host Machine (For Model Export)

⚡ Jetson Thor (For Engine Build & Inference)

Part 1: Setting Up the Python Export Pipeline on x86

Step 1.1: Verify Your GPU and CUDA Installation

Step 1.2: Clone the Repository

Step 1.3: Install TTS-Specific Dependencies

Step 1.4: Download the Model

Step 1.5: Export the Model Components

Step 1.6: Transfer the ONNX Model to Jetson Thor

Part 2: Building the C++ Runtime on Jetson Thor

Step 2.1: Install System Dependencies & Verify

Step 2.2: Clone and Build

Part 3: Building Engines and Running TTS Inference

Step 3.1: Set Environment Variables

Step 3.2: Build the TensorRT Engines

Step 3.3: Copy Tokenizer and Chat Template Files

Step 3.4: Create an Input File & Run Inference

Bonus: Running Nemotron-3-Nano-4B on Jetson Thor

Part 1: Export on x86 Host (BF16)

Part 2: Build Engine & Infer on Jetson Thor

Credits

Nurgaliyev Shakhizat

Comments

Embed the widget on your own site

Getting Started with Nvidia TensorRT-Edge-LLM on Jetson Thor

Getting Started with Nvidia TensorRT-Edge-LLM on Jetson Thor

Story

What's New in v0.6.0 of NVIDIA's TensorRT Edge-LLM?

What is Qwen3-TTS?

Prerequisites

💻 x86 Host Machine (For Model Export)

⚡ Jetson Thor (For Engine Build & Inference)

Part 1: Setting Up the Python Export Pipeline on x86

Step 1.1: Verify Your GPU and CUDA Installation

Step 1.2: Clone the Repository

Step 1.3: Install TTS-Specific Dependencies

Step 1.4: Download the Model

Step 1.5: Export the Model Components

Step 1.6: Transfer the ONNX Model to Jetson Thor

Part 2: Building the C++ Runtime on Jetson Thor

Step 2.1: Install System Dependencies & Verify

Step 2.2: Clone and Build

Part 3: Building Engines and Running TTS Inference

Step 3.1: Set Environment Variables

Step 3.2: Build the TensorRT Engines

Step 3.3: Copy Tokenizer and Chat Template Files

Step 3.4: Create an Input File & Run Inference

Bonus: Running Nemotron-3-Nano-4B on Jetson Thor

Part 1: Export on x86 Host (BF16)

Part 2: Build Engine & Infer on Jetson Thor

Credits

Nurgaliyev Shakhizat

Comments

Related channels and tags