Requirements
NVIDIA Jetson AGX Orin 64GB Developer Kit
Prepare the NVIDIA Jetson AGX Orin board
Enable Maximum Performance
Disable Desktop GUI
Performance monitoring tool - jtop
Llamaspeak - Talk live with LLM's using NVIDIA Riva ASR and TTS
References

Published October 3, 2023 © GPL3+

Getting Started with AI on Nvidia Jetson AGX Orin Dev Kit

You can talk to ChatGPT like LLM in a similar manner to how you interact with Google Assistant, Alexa, or Siri.

AdvancedFull instructions provided3 hours4,285

Getting Started with AI on Nvidia Jetson AGX Orin Dev Kit

Things used in this project

Hardware components

NVIDIA Jetson AGX Orin Developer Kit

Story

Large Language Models (LLMs) require no introduction, and it's no secret that ChatGPT has revolutionized the world of AI. Additionally, there has been an increase in computing demand for compact size and low cost, making single-board computers increasingly popular these days. In this tutorial, I will illustrate the importance of converting language models similar to ChatGPT into voice-activated assistants using the NVIDIA Jetson AGX Orin 64GB Developer Kit, making it accessible for everyone to replicate. The employment of a ChatGPT-like language model on the Nvidia Jetson board offers several advantages, including minimized network latency, enhanced privacy, and the capacity to use the model in environments with limited resources without an internet connection.

To achieve this, we require a web based application that can operate the Large Language Models, transmit the audio input obtained from Automatic Speech Recognition, and utilize Text-to-speech technology to convert the answer provided by LLM into a human voice. Fortunately, Dustin Franklin from Nvidia released an awesome project called llamaspeak. Therefore, let's proceed to test it!

Requirements

For simplicity we will assume everything is installed. So take a look at the requirements before starting the tutorial. We will require the following for our project:

Hardware Required:

NVIDIA Jetson AGX Orin 64GB Developer Kit
Laptop or standalone computer

Additional requirements

Some experience python programming is helpful but not required.
To run the NVIDIA Jetson board headless(without the monitor), set up either SSH access or RDP connection from your laptop.
Familiar with the Linux command line, a shell like bash
Basic knowledge of Docker and containerization

NVIDIA Jetson AGX Orin 64GB Developer Kit

The NVIDIA Jetson AGX Orin 64GB Developer Kit includes all you need to get started immediately. Here's what I found when I unboxed my NVIDIA Jetson AGX Orin 64GB Developer Kit.

1 / 2

The NVIDIA Jetson AGX Orin 64GB Developer Kit, known for its powerful Ampere based GPU and compact form factor, offers an excellent platform for running sophisticated language models.

The NVIDIA Jetson AGX Orin 64GB Developer Kit,

The NVIDIA Jetson AGX Orin 64GB Developer Kit has a 64 GB eMMC 5.1 Flash Storage. This may be enough to start exploring the board and running deep learning applications, but you will probably end up needing more storage capacity and also faster read/write speeds for more serious applications like Large Language models. When running the larger models, make sure you have enough disk space.

You will need to use an NVMe SSD drive, which I would recommend anyway due to the significant performance improvements that brings and increased disc space. I have been using the Samsung 980/990 Pro NVMe SSD with the Nvidia Jetson AGX Orin Developer Kit. I highly recommend using Samsung NVMe SSDs.

After installing the NVMe SSD, the next step is to check the benchmark results.

We can see average read/write speed of the SSD at the screen. The performance and capacity of NVMe SSDs are considered suitable for LLM applications.

Prepare the NVIDIA Jetson AGX Orin board

To get started, first ensure that NVIDIA Jetson AGX Orin is up and running. As of the time of this writing, JetPack-5.1 is the official SDK for Jetson AGX Orin board.

Please run the following command to install all the packages:

sudo apt update 
sudo apt install nvidia-jetpack

Then reboot your NVIDIA Jetson AGX Orin.

After reboot, make sure that /etc/docker/daemon.json has the configuration that enables running the Nvidia Container Runtime.

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}

Restart docker service using below command.

sudo service docker restart

Enable Maximum Performance

If you want to have access to the full power of the NVIDIA Jetson AGX Orin, you can enable the Max performance model:

sudo nvpmodel -m 0

This will maximize the performance of your application at the expense additional power consumption.

Set static max frequency to CPU, GPU and EMC clocks

sudo jetson_clocks

Disable Desktop GUI

We can disable the desktop environment to save memory on RAM.

sudo systemctl set-default multi-user.target

Then restart

sudo reboot

To enable GUI again, run the following command.

sudo systemctl set-default graphical.target

Then restart your jetson board

sudo reboot

Performance monitoring tool - jtop

Jetson statsis a useful tool with a beautiful interface developed by Raffaello Bonghi.

You can install it using below command

sudo -H pip install -U jetson-stats

then run it,

jtop

Below screenshot of jtop

Jetson stats

Llamaspeak - Talk live with LLM's using NVIDIA Riva ASR and TTS

The original Llama 2 FP16 models are too large to run on the NVIDIA Jetson AGX Orin 64GB. We require a minimum of either 2 GPUs with 80GB each, 4 GPUs with 48GB each, or 6 GPUs with 24GB each. Nevertheless, it is feasible to execute Llama 2 by employing 4-bit quantization techniques, and many machine learning enthusiasts are already using this approach. GGML is a tensor library, no extra dependencies needed(Torch, Transformers, Accelerate), CUDA/C++ is all you need for GPU execution. Let’s use the weights converted by TheBloke (aka Tom Jobbins).

llamaspeak is a web-based implementation of a voice assistant developed by Dustin Franklin from Nvidia. It enables the understanding of spoken speech through RIVA ASR, processes it using LLM, and generates spoken responses via RIVA TTS, offering a conversational experience for users employing a ChatGPT-like Large Language Model on NVIDIA Jetson boards.

llamaspeak pipeline for a voice assistant involves several steps to convert spoken language into a meaningful response. Here's a software diagram of the pipeline:

Source: https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/llamaspeak

Carefully follow the instructions here to setup llamaspeak.

Open the config.sh file and make the following modifications:

service_enabled_asr=true
service_enabled_nlp=false
service_enabled_tts=true
service_enabled_nmt=false

You can also modify the config.sh to run only a single model

Start Riva server by running the command:

bash riva_start.sh

Basically, Riva server is powered by NVIDIA TensorRT optimizations and served using the NVIDIA Triton Inference Server. Riva offers automatic speech recognition (ASR), human-like text-to-speech (TTS) and etc.

You will see the following output:

Starting Riva Speech Services. This may take several minutes depending on the number of models deployed.
86837cf54d4160e1193c0b7eb41b98e32b28aaed09def4104ecbda9b0ff3aaa9
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Riva server is ready...
Use this container terminal to run applications:

Use this terminal command to check the number of threads available:

lscpu | egrep 'Model name|Socket|Thread|NUMA|CPU\(s\)'

CPU(s):                          12
On-line CPU(s) list:             0-11
Thread(s) per core:              1
Socket(s):                       3
Model name:                      ARMv8 Processor rev 1 (v8l)

We have a total of 12 threads on your system. We won't benefit from hyper-threading in this case, so the number of threads you use should be less than or equal to the number of CPU cores (in our case, up to 12 threads).

Before proceeding further, ensure the model is downloaded from Hugging Face. Meta's LLaMA is one of the most popular open-source LLMs(Large Language Models) available today. So, we can download quantized 13B model of LLaMA2.

Once the Riva server status is running, open another terminal and execute the following command:

./run.sh --workdir /opt/text-generation-webui $(./autotag text-generation-webui) \
   python3 server.py --listen --verbose --api \
	--model-dir=/data/models/text-generation-webui \
	--model=llama-2-13b-chat.ggmlv3.q4_0.bin \
	--loader=llamacpp \
	--n-gpu-layers=128 \
	--n_ctx=4096 \
	--n_batch=4096 \
	--threads=$(($(nproc) - 2))

Please note that the execution might take a few minutes.

The main parameters are:

--n_ctx: Maximum context size. The length of the context.
--n-gpu-layers: Number of layers to offload to GPU (-ngl) How many model layers to put on the GPU, we choose to put the entire model on the GPU.
--n_batch: Maximum number of prompt tokens to batch together when calling llama_eval.
--threads: Number of threads to use. If None, the number of threads is automatically determined. nproc  returns the number of processing units (CPU cores) available on the system.

Llama 1 supports up to 2048 tokens, Llama 2 up to 4096, CodeLlama up to 16384.

You will receive output similar to:

2023-10-01 08:03:15 INFO:Loaded the model in 12.64 seconds.

Starting streaming server at ws://0.0.0.0:5005/api/v1/stream
2023-10-01 08:03:15 INFO:Loading the extension "gallery"...
Starting API at http://0.0.0.0:5000/api
Running on local URL:  http://0.0.0.0:7860

Open another terminal and run llamaspeak using the following command:

./run.sh --workdir=/opt/llamaspeak \
  --env SSL_CERT=/data/cert.pem \
  --env SSL_KEY=/data/key.pem \
  $(./autotag llamaspeak) \
  python3 chat.py --verbose

You should get the similar output on the terminal

* Running on all addresses (0.0.0.0)
 * Running on https://127.0.0.1:8050
 * Running on https://192.168.0.104:8050

Here is a demonstration video of what the result looks like.

Now, you can talk to ChatGPT like LLM in a similar manner to how you interact with Google Assistant, Alexa, or Siri.

Here is the utilization of RAM

Currently, llama.cpp no longer supports GGML models. The GGML format has now been superseded by GGUF. I will attempt to test with Mistral-7B-v0.1-GGUF, as this model is considered to be promising. In benchmarks, Mistral 7B beats Meta's Llama 2 13B model while approaching the scores of Llama's massive 70B parameter version. This is despite being almost twice as small as Llama 2 13B.

./run.sh --workdir /opt/text-generation-webui $(./autotag text-generation-webui) \
python3 server.py --listen --verbose --api \
--model-dir=/data/models/text-generation-webui \
--model=mistral-7b-instruct-v0.1.Q4_0.gguf \
--loader=llamacpp \
--n-gpu-layers=128 \
--n_ctx=4096 \
--n_batch=4096 \
--threads=$(($(nproc) - 2))

Here is a demonstration showcasing the Mistral 7B LLM running on the NVIDIA AGX Orin developer kit.

Below is RAM utilization while running Mistral 7B LLM.

Running the same prompt again often yields different responses, making it very challenging to reliably generate responses with quantized models. They are definitely not deterministic. Since, there are models in the The Bloke's Hugging Face repo available for each combination of model type and parameter count. You will notice that each model type offers multiple quantization options, allowing you to conduct further experiments. Here’s the Hugging Face repo for CodeLlama-7B-GGUF,Llama 2 13B Chat - GGUF and Mistral-7B-v0.1-GGUF.

That’s it for today! Special thanks to Dustin Franklin from Nvidia. Here I explored how to set up and run a ChatGPT-like large language model on the NVIDIA Jetson AGX Orin Developer Kit, enabling you to have conversational AI capabilities locally using automatic speech recognition and text-to-speech technologies.

I hope you found this post useful and thanks for reading it. If you have any questions or feedback, leave a comment below.

References

Nurgaliyev Shakhizat

71 projects • 181 followers

I am a hardcore robotics and IoT enthusiast. Email: shahizat005@gmail.com

Getting Started with AI on Nvidia Jetson AGX Orin Dev Kit

Things used in this project

Hardware components

Story

Requirements

NVIDIA Jetson AGX Orin 64GB Developer Kit

Prepare the NVIDIA Jetson AGX Orin board

Enable Maximum Performance

Disable Desktop GUI

Performance monitoring tool - jtop

Llamaspeak - Talk live with LLM's using NVIDIA Riva ASR and TTS

References

Credits

Nurgaliyev Shakhizat

Comments

Embed the widget on your own site

Getting Started with AI on Nvidia Jetson AGX Orin Dev Kit

Getting Started with AI on Nvidia Jetson AGX Orin Dev Kit

Things used in this project

Hardware components

Story

Requirements

NVIDIA Jetson AGX Orin 64GB Developer Kit

Prepare the NVIDIA Jetson AGX Orin board

Enable Maximum Performance

Disable Desktop GUI

Performance monitoring tool - jtop

Llamaspeak - Talk live with LLM's using NVIDIA Riva ASR and TTS

References

Credits

Nurgaliyev Shakhizat

Comments

Related channels and tags