Large Language Models (LLMs) like Chat GPT and Llama 2 have been increasing in popularity due to their potential to fundamentally change how we work with data and human computer interfaces. While these technologies no doubt provide excitement of a future that nears closer to the realization of Artificial General Intelligence (AGI), they are typically designed to run on powerful servers with abundant memory and computing resources.
Thanks to innovations in embedded GPU hardware devices, it is possible to achieve this level of computational power on embedded devices that are not much larger than a hamburger. Meaning, that local results can be produced by LLMs on such devices, enabling scenarios that require real-time processing, privacy, and reduced latency.
In this article we will demonstrate how to run variants of the recently released Llama 2 LLM from Meta AI on NVIDIA Jetson Hardware. What is amazing is how simple it is to get up and running. Everything needed to reproduce this content is more or less as easy as spinning up a Docker container that has provides all of the required software prerequisites for you. You will leave empowered with knowledge to explore running additional models on your own to power custom LLM based applications and services.
For this project, it is recommended to proceed with a Jetson Orin 32GB or 64GB developer Kit as the memory requirements of the models we will use in this tutorial require 18 - 64 Gigabytes of RAM (depending on usage of 13B or 70B Llama 2 model variants).We assume that you are starting from an NVIDIA Jetson Orin device that has been flashed with the latest JetPack image from NVIDIA (r35.4.1 at time of writing).
To begin, ensure that the Docker runtime is set to default to "nvidia" on your Jetson device. Do this by opening /etc/docker/daemon.json in your favorite text editor and modifying the contents it as shown:
Once you have saved this modification, restart the docker service with:
sudo systemctl restart docker
We will need to create a directory to store our model files for Llama, do this by running the following command in a terminal:
Next execute the following command to start the an instance of text-generation-webui in Docker:
docker run --rm --it --name textgeneration-web-ui --net=host --gpus all -v ~/models:/app/models toolboc/text-generation-webui:v1.5_r35.4.1
Note: In this example, the naming scheme referes to a docker image containing text-generation-webui v1.5 for NVIDIA JetPack r35.4.1
The Dockerfile used to generate this image can be found in the oobabooga/text-generation-web-ui repo @ text-generation-webui/docker/Dockerfile.jetson at main · oobabooga/text-generation-webui (github.com). It should be straighforward to modify this file if you wish to build the image yourself or to update to a different version of text-generation-webui.
With the docker container successfully started, you should be able to reach the text-generation-webui application by visiting http://<ip of jetson device>:7860
In a web browser that is connected to the same network as your Jetson device, navigate to http://<ip of jetson device>:7860 You should be greeted with a user interface similar to the following:
From here, select the "Model" tab, then select the "Download custom model or LoRA" field in the lower-right and paste in "TheBloke/Llama-2-13B-chat-GPTQ".
This will begin downloading the Llama 2 chat GPTQ model variant from TheBloke/Llama-2-13B-chat-GPTQ · Hugging Face. The proceeding steps will apply to technically any GPTQ based Llama variant, so if you are inclined to try other models, you now know how. Once the model has downloaded, select it from the "Model" dropdown on the top-left side of the screen.
Next, we need to choose a suitable model loader. GPTQ models can be run in a variety of ways, the first option is to use the GPTQ-for-LLaMa model loader as shown below:
Using this method requires that you manually configure the wbits, groupsize, and model_type as shown in the image.
Note: These parametersare able to inferred by viewing the Hugging Face model card information at TheBloke/Llama-2-13B-chat-GPTQ · Hugging Face
While this model loader will work, we can gain ~25% in model performance (~5.2 tokens/sec vs 4.2 tokens/sec) by instead opting to use the ExLlama_HF model loader, so lets select that instead. With this model loader, you can leave the default options as shown.
Note: By following the above instructions and instead selecting to use the TheBloke/Llama-2-70B-chat-GPTQ variant from https://huggingface.co/TheBloke/Llama-2-70B-chat-G, you can get even more intelligent responses, although the performance will top out at about .9 tokens / second.
We are almost ready to begin using the model.
To begin testing with a live prompt, select the "load" function while in the "Model" tab. then select the "Text generation" tab. From here, select "Instruct-Llama-v2" from the prompt drop-down menu (if you are using a different model, a different prompt choice may be better suited).
Next, you will want to increase the "max_new_tokens" to 2048 as this will produce longer responses (keep in mind that values higher than 2048 will result in errors that will prohibit output due to the fact that Llama 2 models are trained with max of 2048).
Now we are ready to get results from our model. To do this modify the area the says "Input" in the Instruct-Llama-v2" prompt template on the left side of the screen to contain a new prompt. For example: "Are Large Langauage Models like LLaMa supported on embedded hardware?", the press the "Generate" button and watch the model begin to produce ouput on the right side of the screen.
The text-generation-webui application has a number of options that can allow you modify its look, style, and behavior. These are included but not limited to support for a more chat friendly ui, light / dark mode, and text-to-speech capabilities. To view available options, you can explore the "sessions" tab to reveal these features and more. For an in-depth and up-to-date resource, check out the official documentation at text-generation-webui/docs at main · oobabooga/text-generation-webui (github.com).
Of particular interest is the api option, which can allow you to produce programmatic prompts and responses that can be consumed in custom applications (think HomeAssistant extensions that can enable interaction using a microphone / speaker to talk with your model using human voice).
Here is an example of a number of these features in action (chat mode, light mode, and asking the model to produce use cases for LLMs on embedded hardware)
If you are interested in building an even more capable textgeneration-web-ui container from scratch, and want to try additional loaders, check out jetson-containers/packages/llm/text-generation-webui at master · dusty-nv/jetson-containers (github.com). This approach contains support for additional loaders including llama.cpp (which is capable of achieving 13 tokens / sec using 4-bit quantized GGML based Llama 2 13B models).
Large Language Models can run on embedded hardware, and while this may be a bit expensive to aquire a capable device today, we can expect to see future innovations that allow for more intelligent interactions at a lower cost. This will open the door for a number of different applications, however, I'm going to let LlaMa 2 explain this (since it does a rather remarkable job):
User: What are the potential use cases of LLMs (Large Language Models) when running on embedded hardware?
Llama2: Hello! I'd be happy to help answer that question. Large language models (LLMs) have been gaining popularity in recent years due to their impressive performance on a wide range of natural language processing tasks. When run on embedded hardware, LLMs have the potential to enable a variety of exciting applications, including but not limited to:
Voice assistants: Embedded LLMs can be used to create voice-controlled personal assistants for devices like smart speakers, smartphones, and other IoT devices. These assistants can understand and respond to voice commands, allowing users to control their devices hands-free.
Text-to-speech synthesis: LLMs can be used to generate high-quality speech output from text input, enabling devices like smart speakers, e-readers, and other audio-based devices to read out text content in a human-like voice.
Real-time translation: With the ability to process large amounts of data quickly and accurately, LLMs can be used for real-time translation of spoken language, enabling communication between people who speak different languages.
Sentiment analysis: LLMs can be used to analyze the sentiment of text data, enabling businesses to monitor customer feedback, sentiment, and opinions in real-time.
Predictive maintenance: By analyzing sensor data from machines and equipment, LLMs can predict when maintenance is needed, reducing downtime and increasing overall efficiency.
Smart home automation: LLMs can be used to control and automate smart homes, enabling users to control lighting, temperature, security systems, and more with voice commands or through other interfaces.
Health monitoring: LLMs can be used to analyze health data from wearables and medical devices, enabling early detection of health issues and improving patient outcomes.
Autonomous vehicles: LLMs can be used to improve the perception and understanding of autonomous vehicles, enabling them to better recognize and respond to their surroundings.
Robotics: LLMs can be used to improve the control and interaction of robots with humans, enabling them to better understand and execute tasks.
Edge AI: With the increasing amount of data being generated at the edge of the network, LLMs can be used to perform AI tasks directly on edge devices, reducing latency and improving real-time performance.
These are just a few examples of the many potential use cases of LLMs when running on embedded hardware. As the technology continues to advance, we can expect to see even more innovative applications of LLMs in the future.
Until next time, Happy Hacking!