Deploying machine learning models to production is a critical step in turning research and development efforts into practical applications. Running large language models locally has traditionally been challenging, requiring significant hardware resources and technical expertise. This tutorial will focus on deploying LLMs on the NVIDIA Jetson AGX Orin 64GB using k3s and llama.cpp, with a practical example using the Gemma 3 and Qwen 3 models.
Before diving into the deployment process, let's define the key technologies involved:
- Kubernetes (K3s): Kubernetes is an open-source container orchestration system that automates the deployment, scaling, and management of containerized applications. K3s is a lightweight, certified Kubernetes distribution built for resource-constrained environments, making it suitable for edge computing like Nvidia Jetson AGX Orin.
- llama.cpp: This is a library for running large language models efficiently on CPUs and GPUs. It's written in C++ and optimized for performance, allowing for relatively low-resource inference. We'll leverage llama.cpp to serve our Gemma 3 and Qwen 3 models on the Jetson AGX Orin.
- Gemma 3: Gemma is a family of open-weight language models developed by Google. In this tutorial, we will deploy a specific Gemma 3 QAT model and use another Gemma model for speculative decoding.
- Speculative Decoding: A technique used to speed up LLM inference.  A smaller, faster draftmodel generates an initial sequence of tokens, and then the larger, more accurate model verifies these tokens. This can significantly reduce the time it takes to generate text.
- Flash Attention: For faster decoding, we use Flash Attention when compiling llama.cpp.
- Open WebUI: A web-based user interface for interacting with LLMs. It provides a chat-like experience for users to send prompts and receive responses from the deployed model.
The following diagram shows our setup:
The load balancer handles load balancing between input requests from external clients like Open WebUI and multiple llama.cpp serving pod replicas deployed on the Nvidia Jetson Orin Developer kit. The number of replicas can range from 1 to N, depending on various factors such as model parameters, precision, and GPU memory consumption.. For this setup, I will use two or three replica pods to ensure high availability of the inference engine.
Building llama.cpp on the NVIDIA Jetson AGX Orin 64GBBuilding llama.cpp was fairly straightforward. First, install ccache to speed up the building process:
sudo apt-get install -y ccacheNext, clone the llama.cpp repository from GitHub. This repository contains all the necessary source code and build scripts:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cppOnce inside the llama.cpp directory, you'll need to configure the build using CMake. The following cmake command sets up the build environment:
cmake -S . \
  -B build \
  -G Ninja \
  -DGGML_CUDA=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_BUILD_TESTS=OFF \
  -DLLAMA_BUILD_EXAMPLES=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_CUDA_ARCHITECTURES="87"We use -DGGML_CUDA_FA_ALL_QUANTS=ON along with the specific CUDA architecture -DCMAKE_CUDA_ARCHITECTURES="87" to reduce compilation time.
After configuring the build with CMake, you can compile the project using the following command:
cmake --build build --config Release -j $(nproc)Once llama.cpp is built, you'll need to download the desired model weights. For a good balance between model size and accuracy, I recommend using a 4-bit quantized version of the Gemma 3 model. You can download this using the huggingface_hub library in Python:
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="google/gemma-3-27b-it-qat-q4_0-gguf",
    local_dir="google/gemma-3-27b-it-qat-q4_0-gguf",
    allow_patterns=["*Q4_K_M*"],
)Then, also download the gemma-3-1b-it-qat-gguf model. The gemma-3-27b-it-qat model will serve as the main model, and gemma-3-1b-it-qat will be used as a draft model. You may also download the gemma-3-12b-it-qat model and increase the number of replicas in the deployment plan.
After downloading the model, start the llama.cpp server using:
build/bin/llama-server \
  -c 16384 \
  -cd 4096 \
  -m google/gemma-3-27b-it-qat-q4_0-gguf/gemma-3-27b-it-q4_0.gguf \
  -md google/gemma-3-1b-it-qat-q4_0-gguf/gemma-3-1b-it-q4_0.gguf \
  -ngld 999 \
  -ngl 999 \
  --draft-max 3 \
  --draft-min 3 \
  -t 24 \
  --flash-attn \
  -ctv q8_0 \
  -ctk q8_0 \
  --port 10000Upon running this command, the llama-server will automatically start. It will typically launch a web interface in your default browser. You can then interact with the Gemma 3 QAT model through this interface.
You can then enter a prompt in the interface to test the token generation speed and interact with the loaded Gemma 3 model.
Running the LLaMA Server with the Qwen3-30B-A3B ModelQwen just released 8 new models as part of its latest family – Qwen3. The Qwen3 model, specifically the Qwen3-30B-A3B variant, is engineered for both in-depth reasoning (thinking mode) and rapid response generation (non-thinking mode). It also boasts impressive multilingual capabilities, supporting over 100 languages.
To run the llama-server with the Qwen3-30B-A3B GGUF model, you first need to download the corresponding model file. You can do this similarly to the Gemma model, adjusting the repo_id and allow_patterns:
import os
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="unsloth/Qwen3-30B-A3B-GGUF",
    local_dir="models/Qwen3-30B-A3B-GGUF",
    allow_patterns=["*Q4_K_M*"],
)Once downloaded, you can start the server using the following command:
build/bin/llama-server \
--model unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-Q4_K_M.gguf \
--threads 24 \
--ctx-size 16384 \
--n-gpu-layers 999 \
-t 24 \
--flash-attn \
-ctv q8_0 \
-ctk q8_0 \
--port 10000Here's an example demonstrating the thinking mode:
Here is example of the non thinking mode:
Prompt: Write a Python function to add two numbers /no_think.
The average token generation speed observed with this setup is consistently 27 tokens per second. By following these detailed steps, you should be able to successfully build llama.cpp and run large language models like Gemma 3 and Qwen3 on your NVIDIA Jetson AGX Orin 64GB.
Setting up K3s on Nvidia Jetson AGX OrinNext, we need to transform the NVIDIA Jetson AGX Orin 64GB into a high-availability inference engine using Kubernetes. We can utilize the lightweight version of Kubernetes - k3s.
Since we're working with a Nvidia Jetson AGX Orin, it's crucial to ensure that K3s is set up to leverage the device's GPU. K3s can utilize the NVIDIA Container Runtime, which is part of the JetPack SDK.
First, install K3s with the NVIDIA Container Runtime support:
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--docker --write-kubeconfig-mode 644 --disable=traefik" sh -This command installs K3s, configures it to use Docker, sets the kubeconfig file permissions, and disables Traefik.  The --docker flag is important, as it ensures that K3s uses Docker, which is compatible with the NVIDIA Container Runtime.
Verify that the K3s cluster is running correctly:
kubectl get nodesYou should see the Jetson AGX Orin listed as a node in the cluster.
NAME     STATUS   ROLES                  AGE   VERSION
ubuntu   Ready    control-plane,master   18d   v1.32.3+k3s1This confirms that your single-node K3s cluster is operational.
Next, we need to create a container image for llama.cpp that is optimized for the Nvidia Jetson AGX Orin's architecture.  We'll use the jetson-containers  project, which simplifies this process.
LSB_RELEASE=24.04 CUDA_VERSION=12.8 jetson-containers build llama_cppThis process will build a Docker image that includes the necessary libraries and configurations to run llama.cpp with GPU support. The resulting image will be tagged appropriately (e.g., cu12.8:r36.4.3-cu128-24.04)
You can also use ready-to-go container images available on Docker Hub, such as the llama.cpp image by Dustin Franklin.
Now, let's create the Kubernetes deployment and service definitions to run the llama.cpp server with the Gemma models. Save the following YAML definitions into two separate files: llama-cpp-deployment.yaml and openwebui-deployment.yaml.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-cpp-deploy
  labels:
    app: llama-cpp-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llama-cpp-app
  template:
    metadata:
      labels:
        app: llama-cpp-app
    spec:
      containers:
        - name: llama-cpp-container
          image: cu128:r36.4.3-cu128-24.04
          command:
            [
              "/bin/bash",
              "-c",
              "llama-server --batch-size 2048 --ubatch-size 512 -c 16384 -cd 4096 -m /model/google/gemma-3-27b-it-qat-q4_0-gguf/gemma-3-27b-it-q4_0.gguf -md /model/google/gemma-3-1b-it-qat-q4_0-gguf/gemma-3-1b-it-q4_0.gguf -ngld 999 -ngl 999 --draft-max 3 --draft-min 3 -t 24 -fa -ctv q8_0 -ctk q8_0 --port 10000 --host 0.0.0.0 --metrics --parallel 1 --no-webui"
            ]
          ports:
            - containerPort: 10000
          volumeMounts:
            - name: model-storage
              mountPath: /model
      volumes:
        - name: model-storage
          hostPath:
            path: /home/ubuntu/Projects/llama.cpp
            type: Directory
---
apiVersion: v1
kind: Service
metadata:
  name: llama-cpp-service
  labels:
    app: llama-cpp-app  
spec:
  type: LoadBalancer
  selector:
    app: llama-cpp-app
  ports:
    - name: http           
      protocol: TCP
      port: 10000
      targetPort: 10000A number of replicas is set to 2, as we're deploying on a single Nvidia Jetson AGX Orin. You might increase this if you have multiple Jetson devices in a cluster.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  labels:
    app: open-webui
  name: open-webui-pvc
spec:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: open-webui
  template:
    metadata:
      labels:
        app: open-webui
    spec:
      containers:
        - name: open-webui
          image: ghcr.io/open-webui/open-webui:main
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "500m"
              memory: "500Mi"
            limits:
              cpu: "1000m"
              memory: "1Gi"
          env:
            - name: OPENAI_API_BASE_URL
              value: "http://<Jetson_AGX_Orin_IP>:31000/v1" 
          tty: true
          volumeMounts:
            - name: webui-volume
              mountPath: /app/backend/data
      volumes:
        - name: webui-volume
          persistentVolumeClaim:
            claimName: open-webui-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: openwebui-service
spec:
  type: NodePort # Expose OpenWebUI
  selector:
    app: open-webui
  ports:
    - port: 8080
      targetPort: 8080
      nodePort: 31001 # Choose a different portRemember to adjust file paths and parameters according to your specific setup and the models you are using.
Apply these YAML files to your K3s cluster on the Nvidia Jetson AGX Orin:
kubectl apply -f llama-cpp-deployment.yaml
kubectl apply -f openwebui-deployment.yamlCheck the pods:
kubectl get podsThe pods have been successfully created and are running:
NAME                                     READY   STATUS    RESTARTS   AGE
llama-cpp-deploy-7d7ddd8448-527m8        1/1     Running   0          40s
llama-cpp-deploy-7d7ddd8448-v7jtr        1/1     Running   0          40s
llama-cpp-deploy-7d7ddd8448-xrxtj        1/1     Running   0          40s
open-webui-deployment-5d9f8dc9bf-txnw6   1/1     Running   0          35sCheck the services:
kubectl get svcThe output will list the services running in your cluster, including the ones you just created:
NAME                TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)           
kubernetes          ClusterIP      10.43.0.1      <none>        443/TCP          
llama-cpp-service   LoadBalancer   10.43.227.86   YOUR_IP  10000:31943/TCP
openwebui-service   LoadBalancer   10.43.40.30    YOUR_IP   8080:30119/TCPIn k3s, LoadBalancer services are handled by klipper-lb, which assigns the host machine's IP. Open a web browser on a machine that can access your Jetson AGX Orin's network and navigate to:
http://<YOUR_IP>:8080Replace <YOUR_IP> with the actual IP address of your Nvidia Jetson AGX Orin. You should now see the Open WebUI interface, which is connected to the llama.cpp server running the Gemma models.
Qwen3-30B-A3B is a 30 billion parameter model that keeps 3 billion active at once. It also works quite well, and achieves reasonable speed (20-30 tokens/s) on the Nvidia Jetson AGX Orin.
The Nvidia Jetson AGX Orin has 64GB of memory, which is a significant resource.  However, LLMs can be memory-hungry.  Carefully monitor memory usage and consider techniques like quantization to reduce the memory footprint of your models using tools like jetson_stats.
The llama-server command in the deployment YAML includes parameters like -ngl and -ngld to control how much of the model is loaded on the GPU.  Adjust these for optimal performance.
I hope you found this guide useful and thanks for reading. If you have any questions or feedback? Leave a comment below. If you like this post, please support me by subscribing to my blog.
References


Comments