Vision-Language Models (VLMs) are powerful AI models that can learn from both images and text, enabling them to perform tasks such as visual question answering, image captioning, and multimodal reasoning.
The general structure of most VLMs is as follows:
Currently, LLMs and VLMs are being trained on huge clusters of Nvidia/AMD GPUs. This raises the question: how can individuals train models using consumer-grade hardware?
So, lucky us, Hugging Face has recently open-sourced nanoVLM project
specifically designed for on-device training and inference with a lightweight implementation in pure PyTorch. We can train a small-scale VLM (222M parameters) that combines a ViT encoder(google/siglip-base-patch16-224
) and an LLM (HuggingFaceTB/SmolLM2-135M
) to understand both text and visual input. It uses data from the HuggingFaceM4/the_cauldron
dataset. Checkpointing is supported, with the ability to resume from the latest checkpoint. More details: nanoVLM: The simplest repository to train your VLM in pure PyTorch.
This tutorial will guide you through the process of running a nanoVLM model training using Kubernetes directly on the NVIDIA Jetson AGX Orin Developer Kit.
Setting up K3s on Nvidia Jetson AGX OrinInstall K3s with the NVIDIA Container Runtime support:
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--docker --write-kubeconfig-mode 644 --disable=traefik" sh -
This command installs K3s, configures it to use Docker, sets the kubeconfig file permissions, and disables Traefik. The --docker
flag is important, as it ensures that K3s uses Docker, which is compatible with the NVIDIA Container Runtime.
Verify that the K3s cluster is running correctly:
kubectl get nodes
You should see the Jetson AGX Orin listed as a node in the cluster.
NAME STATUS ROLES AGE VERSION
ubuntu Ready control-plane,master 18d v1.32.3+k3s1
This confirms that your single-node K3s cluster is operational.
Let's install K9s
, similar to Lens, it is a Kubernetes cluster management tool via a CLI user interface. We can install k9s using the below command:
curl -sS https://webinstall.dev/k9s | bash
Running A Training JobWe’re now ready to start training a model!
Create a manifest file called train.yaml and populate it with the following code:
apiVersion: batch/v1
kind: Job
metadata:
name: nanovlm-train
labels:
app: nanovlm
spec:
backoffLimit: 4
template:
metadata:
labels:
app: nanovlm
spec:
restartPolicy: Never
containers:
- name: nanovlm
image: nvcr.io/nvidia/pytorch:24.08-py3-igpu
command: ["bash", "-c"]
args:
- |
pip install numpy torchvision pillow datasets huggingface-hub transformers wandb;
echo "Starting training with WANDB...";
export HF_HOME=/mnt/hf_cache;
export TRANSFORMERS_CACHE=/mnt/hf_cache;
export WANDB_CACHE_DIR=/mnt/wandb_cache;
wandb login ${WANDB_API_KEY} && \
python3 train.py
env:
- name: WANDB_API_KEY
valueFrom:
secretKeyRef:
name: wandb-secret
key: api-key
volumeMounts:
- name: code
mountPath: /workspace/nanoVLM
- name: hf-cache
mountPath: /mnt/hf_cache
- name: wandb-cache
mountPath: /mnt/wandb_cache
- name: dshm
mountPath: /dev/shm
workingDir: /workspace/nanoVLM
resources:
requests:
cpu: "4"
memory: "16Gi"
volumes:
- name: code
hostPath:
path: /home/ubuntu/Projects/nanoVLM
type: Directory
- name: hf-cache
hostPath:
path: /home/ubuntu/.cache/huggingface
type: Directory
- name: wandb-cache
hostPath:
path: /home/ubuntu/.cache/wandb
type: Directory
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 8Gi
Kubernetes will retry the pod up to 4 times due to backoffLimit: 4
. This is useful for controlling fault tolerance in batch jobs like training loops that might fail due to transient issues.
To launch the job, run this command:
kubectl apply -f train.yaml
Sometimes a pod will not start promptly. So you will see 0/1 in the “READY” column and Pending in the “STATUS” column. If you want to see details of the status of your pod, run this command:
kubectl get pods
We can also inspect a NanoVLM pod with K9s.
k9s --kubeconfig /etc/rancher/k3s/k3s.yaml
You should see output similar to the following below:
Check the logs for any startup issues.
kubectl logs -f POD_NAME
Remember to replace POD_NAME with the actual name of your training pod.
Once training has finished, your job's status will show "Completed." Here's an example of the output:
NAME READY STATUS RESTARTS AGE
nanovlm-train-5b8g2 0/1 Completed 0 47h
You'll also see output similar to this, indicating the training progress:
Loading from backbone weights
Successfully loaded google/siglip-base-patch16-224 weights from safetensors. Model has 85,797,120 parameters.
Successfully loaded HuggingFaceTB/SmolLM2-135M weights from safetensors. Model has 134,515,008 parameters.
The logs will also provide insights into the training progress, including model loading details, parameter counts, and performance metrics.
nanoVLM initialized with 222,081,600 parameters
Training summary: 1655332 samples, 12932 batches/epoch, batch size 128
Validation summary: 16720 samples, 130 batches/epoch, batch size 128
Epoch 5/5, Train Loss: 0.5771 | Time: 31740.05s | T/s: 4369.14
Average time per epoch: 31879.29s
Average time per sample: 0.0193s
MMStar Accuracy: 0.1013
Trends from wandb would provide a visual representation of these metrics over time. If accuracy improves, the model is checkpointed and saved locally.
This particular training run lasted approximately 1 day and 20 hours, completing 5 epochs. The MMStar accuracy achieved was 0.1013. This isn't about achieving the best model accuracy, it's about verifying that the Nvidia Jetson Orin Dev Kit can handle the computational demands of a training process using a Kubernetes setup.
Upon successful training, we can upload your newly trained model to Hugging Face for sharing and future use. Use the following Python code snippet, ensuring your checkpoint path is correct:
from models.vision_language_model import VisionLanguageModel
checkpoint = "./checkpoints/nanoVLM-222M"
model = VisionLanguageModel.from_pretrained(checkpoint)
model.push_to_hub("nanoVLM-222M")
Next, to perform inference with your uploaded model, execute this command, replacing shakhizat/nanoVLM-222M with the actual model ID if different:
python3 generate.py --hf_model shakhizat/nanoVLM-222M
Here is our demo image:
And here's an example of the expected output when running inference with an input image:
Using device: cuda
Loading weights from: shakhizat/nanoVLM-222M
Input:
What is this?
Outputs:
>> Generation 1: This picture is of outside. At the top of the picture the wall is seen. On the left
>> Generation 2: In the foreground of this image, we can see a mannequin in grey and black color and
>> Generation 3: This image seems to be very blurred. In the middle of the image there are two cats laying on
>> Generation 4: At the bottom of the image, it seems like a cat in white and black fur is sitting on
>> Generation 5: In this image we can see a cat on the ground and a tablet on the right side of the
This isn't a good result for a production-ready model. However, the primary goal of this particular exercise wasn't to achieve state-of-the-art accuracy. Instead, this training job was a crucial feasibility check to see if we could successfully run the NanoVLM model training process on the NVIDIA Jetson AGX Orin Developer Kit. This successful proof-of-concept opens the door for further optimization and development of on-device VLM training.
References:
Comments