Large Language Models (LLMs) require no introduction, and it's no secret that ChatGPT has revolutionized the world of AI. Additionally, there has been an increase in computing demand for compact size and low cost, making single-board computers increasingly popular these days. In this tutorial, I will illustrate the importance of converting language models similar to ChatGPT into voice-activated assistants using the NVIDIA Jetson AGX Orin 64GB Developer Kit, making it accessible for everyone to replicate. The employment of a ChatGPT-like language model on the Nvidia Jetson board offers several advantages, including minimized network latency, enhanced privacy, and the capacity to use the model in environments with limited resources without an internet connection.
To achieve this, we require a web based application that can operate the Large Language Models, transmit the audio input obtained from Automatic Speech Recognition, and utilize Text-to-speech technology to convert the answer provided by LLM into a human voice. Fortunately, Dustin Franklin from Nvidia released an awesome project called llamaspeak. Therefore, let's proceed to test it!
RequirementsFor simplicity we will assume everything is installed. So take a look at the requirements before starting the tutorial. We will require the following for our project:
Hardware Required:
- NVIDIA Jetson AGX Orin 64GB Developer Kit
- Laptop or standalone computer
Additional requirements
- Some experience python programming is helpful but not required.
- To run the NVIDIA Jetson board headless(without the monitor), set up either SSH access or RDP connection from your laptop.
- Familiar with the Linux command line, a shell like bash
- Basic knowledge of Docker and containerization
The NVIDIA Jetson AGX Orin 64GB Developer Kit includes all you need to get started immediately. Here's what I found when I unboxed my NVIDIA Jetson AGX Orin 64GB Developer Kit.
The NVIDIA Jetson AGX Orin 64GB Developer Kit, known for its powerful Ampere based GPU and compact form factor, offers an excellent platform for running sophisticated language models.
The NVIDIA Jetson AGX Orin 64GB Developer Kit has a 64 GB eMMC 5.1 Flash Storage. This may be enough to start exploring the board and running deep learning applications, but you will probably end up needing more storage capacity and also faster read/write speeds for more serious applications like Large Language models. When running the larger models, make sure you have enough disk space.
You will need to use an NVMe SSD drive, which I would recommend anyway due to the significant performance improvements that brings and increased disc space. I have been using the Samsung 980/990 Pro NVMe SSD with the Nvidia Jetson AGX Orin Developer Kit. I highly recommend using Samsung NVMe SSDs.
After installing the NVMe SSD, the next step is to check the benchmark results.
We can see average read/write speed of the SSD at the screen. The performance and capacity of NVMe SSDs are considered suitable for LLM applications.
Prepare the NVIDIA Jetson AGX Orin boardTo get started, first ensure that NVIDIA Jetson AGX Orin is up and running. As of the time of this writing, JetPack-5.1 is the official SDK for Jetson AGX Orin board.
Please run the following command to install all the packages:
sudo apt update
sudo apt install nvidia-jetpack
Then reboot your NVIDIA Jetson AGX Orin.
After reboot, make sure that /etc/docker/daemon.json has the configuration that enables running the Nvidia Container Runtime.
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
Restart docker service using below command.
sudo service docker restart
Enable Maximum PerformanceIf you want to have access to the full power of the NVIDIA Jetson AGX Orin, you can enable the Max performance model:
sudo nvpmodel -m 0
This will maximize the performance of your application at the expense additional power consumption.
Set static max frequency to CPU, GPU and EMC clocks
sudo jetson_clocks
Disable Desktop GUIWe can disable the desktop environment to save memory on RAM.
sudo systemctl set-default multi-user.target
Then restart
sudo reboot
To enable GUI again, run the following command.
sudo systemctl set-default graphical.target
Then restart your jetson board
sudo reboot
Performance monitoring tool - jtopJetson statsis a useful tool with a beautiful interface developed by Raffaello Bonghi.
You can install it using below command
sudo -H pip install -U jetson-stats
then run it,
jtop
Below screenshot of jtop
The original Llama 2 FP16 models are too large to run on the NVIDIA Jetson AGX Orin 64GB. We require a minimum of either 2 GPUs with 80GB each, 4 GPUs with 48GB each, or 6 GPUs with 24GB each. Nevertheless, it is feasible to execute Llama 2 by employing 4-bit quantization techniques, and many machine learning enthusiasts are already using this approach. GGML is a tensor library, no extra dependencies needed(Torch, Transformers, Accelerate), CUDA/C++ is all you need for GPU execution. Let’s use the weights converted by TheBloke (aka Tom Jobbins).
llamaspeak is a web-based implementation of a voice assistant developed by Dustin Franklin from Nvidia. It enables the understanding of spoken speech through RIVA ASR, processes it using LLM, and generates spoken responses via RIVA TTS, offering a conversational experience for users employing a ChatGPT-like Large Language Model on NVIDIA Jetson boards.
llamaspeak pipeline for a voice assistant involves several steps to convert spoken language into a meaningful response. Here's a software diagram of the pipeline:
Carefully follow the instructions here to setup llamaspeak.
Open the config.sh file and make the following modifications:
service_enabled_asr=true
service_enabled_nlp=false
service_enabled_tts=true
service_enabled_nmt=false
You can also modify the config.sh to run only a single model
Start Riva server by running the command:
bash riva_start.sh
Basically, Riva server is powered by NVIDIA TensorRT optimizations and served using the NVIDIA Triton Inference Server. Riva offers automatic speech recognition (ASR), human-like text-to-speech (TTS) and etc.
You will see the following output:
Starting Riva Speech Services. This may take several minutes depending on the number of models deployed.
86837cf54d4160e1193c0b7eb41b98e32b28aaed09def4104ecbda9b0ff3aaa9
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Riva server is ready...
Use this container terminal to run applications:
Use this terminal command to check the number of threads available:
lscpu | egrep 'Model name|Socket|Thread|NUMA|CPU\(s\)'
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Socket(s): 3
Model name: ARMv8 Processor rev 1 (v8l)
We have a total of 12 threads on your system. We won't benefit from hyper-threading in this case, so the number of threads you use should be less than or equal to the number of CPU cores (in our case, up to 12 threads).
Before proceeding further, ensure the model is downloaded from Hugging Face. Meta's LLaMA is one of the most popular open-source LLMs(Large Language Models) available today. So, we can download quantized 13B model of LLaMA2.
Once the Riva server status is running, open another terminal and execute the following command:
./run.sh --workdir /opt/text-generation-webui $(./autotag text-generation-webui) \
python3 server.py --listen --verbose --api \
--model-dir=/data/models/text-generation-webui \
--model=llama-2-13b-chat.ggmlv3.q4_0.bin \
--loader=llamacpp \
--n-gpu-layers=128 \
--n_ctx=4096 \
--n_batch=4096 \
--threads=$(($(nproc) - 2))
Please note that the execution might take a few minutes.
The main parameters are:
--n_ctx: Maximum context size. The length of the context.
--n-gpu-layers: Number of layers to offload to GPU (-ngl) How many model layers to put on the GPU, we choose to put the entire model on the GPU.
--n_batch: Maximum number of prompt tokens to batch together when calling llama_eval.
--threads: Number of threads to use. If None, the number of threads is automatically determined. nproc returns the number of processing units (CPU cores) available on the system.
Llama 1 supports up to 2048 tokens, Llama 2 up to 4096, CodeLlama up to 16384.
You will receive output similar to:
2023-10-01 08:03:15 INFO:Loaded the model in 12.64 seconds.
Starting streaming server at ws://0.0.0.0:5005/api/v1/stream
2023-10-01 08:03:15 INFO:Loading the extension "gallery"...
Starting API at http://0.0.0.0:5000/api
Running on local URL: http://0.0.0.0:7860
Open another terminal and run llamaspeak
using the following command:
./run.sh --workdir=/opt/llamaspeak \
--env SSL_CERT=/data/cert.pem \
--env SSL_KEY=/data/key.pem \
$(./autotag llamaspeak) \
python3 chat.py --verbose
You should get the similar output on the terminal
* Running on all addresses (0.0.0.0)
* Running on https://127.0.0.1:8050
* Running on https://192.168.0.104:8050
Here is a demonstration video of what the result looks like.
Now, you can talk to ChatGPT like LLM in a similar manner to how you interact with Google Assistant, Alexa, or Siri.
Here is the utilization of RAM
Currently, llama.cpp no longer supports GGML models. The GGML format has now been superseded by GGUF. I will attempt to test with Mistral-7B-v0.1-GGUF, as this model is considered to be promising. In benchmarks, Mistral 7B beats Meta's Llama 2 13B model while approaching the scores of Llama's massive 70B parameter version. This is despite being almost twice as small as Llama 2 13B.
./run.sh --workdir /opt/text-generation-webui $(./autotag text-generation-webui) \
python3 server.py --listen --verbose --api \
--model-dir=/data/models/text-generation-webui \
--model=mistral-7b-instruct-v0.1.Q4_0.gguf \
--loader=llamacpp \
--n-gpu-layers=128 \
--n_ctx=4096 \
--n_batch=4096 \
--threads=$(($(nproc) - 2))
Here is a demonstration showcasing the Mistral 7B LLM running on the NVIDIA AGX Orin developer kit.
Below is RAM utilization while running Mistral 7B LLM.
Running the same prompt again often yields different responses, making it very challenging to reliably generate responses with quantized models. They are definitely not deterministic. Since, there are models in the The Bloke's Hugging Face repo available for each combination of model type and parameter count. You will notice that each model type offers multiple quantization options, allowing you to conduct further experiments. Here’s the Hugging Face repo for CodeLlama-7B-GGUF,Llama 2 13B Chat - GGUF and Mistral-7B-v0.1-GGUF.
That’s it for today! Special thanks to Dustin Franklin from Nvidia. Here I explored how to set up and run a ChatGPT-like large language model on the NVIDIA Jetson AGX Orin Developer Kit, enabling you to have conversational AI capabilities locally using automatic speech recognition and text-to-speech technologies.
I hope you found this post useful and thanks for reading it. If you have any questions or feedback, leave a comment below.
References
Comments