Vision-Language Models (VLMs) are powerful AI models that can learn from both images and text, enabling them to perform tasks such as visual question answering, image captioning, and multimodal reasoning.
The general structure of most VLMs is as follows:
Currently, LLMs and VLMs are being trained on huge clusters of Nvidia/AMD GPUs. This raises the question: how can individuals train models using consumer-grade hardware?
So, lucky us, Hugging Face has recently open-sourced nanoVLM project specifically designed for on-device training and inference with a lightweight implementation in pure PyTorch. We can train a small-scale VLM (222M parameters) that combines a ViT encoder(google/siglip-base-patch16-224) and an LLM (HuggingFaceTB/SmolLM2-135M) to understand both text and visual input. It uses data from the HuggingFaceM4/the_cauldron dataset. Checkpointing is supported, with the ability to resume from the latest checkpoint. More details: nanoVLM: The simplest repository to train your VLM in pure PyTorch.
This tutorial will guide you through the process of running a nanoVLM model training using Kubernetes directly on the NVIDIA Jetson AGX Orin Developer Kit.
Setting up K3s on Nvidia Jetson AGX OrinInstall K3s with the NVIDIA Container Runtime support:
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--docker --write-kubeconfig-mode 644 --disable=traefik" sh -This command installs K3s, configures it to use Docker, sets the kubeconfig file permissions, and disables Traefik. The --docker flag is important, as it ensures that K3s uses Docker, which is compatible with the NVIDIA Container Runtime.
Verify that the K3s cluster is running correctly:
kubectl get nodesYou should see the Jetson AGX Orin listed as a node in the cluster.
NAME     STATUS   ROLES                  AGE   VERSION
ubuntu   Ready    control-plane,master   18d   v1.32.3+k3s1This confirms that your single-node K3s cluster is operational.
Let's install K9s, similar to Lens, it is a Kubernetes cluster management tool via a CLI user interface. We can install k9s using the below command:
curl -sS https://webinstall.dev/k9s | bashWe’re now ready to start training a model!
Create a manifest file called train.yaml and populate it with the following code:
apiVersion: batch/v1
kind: Job
metadata:
  name: nanovlm-train
  labels:
    app: nanovlm
spec:
  backoffLimit: 4
  template:
    metadata:
      labels:
        app: nanovlm
    spec:
      restartPolicy: Never
      containers:
      - name: nanovlm
        image: nvcr.io/nvidia/pytorch:24.08-py3-igpu
        command: ["bash", "-c"]
        args:
          - |
            pip install numpy torchvision pillow datasets huggingface-hub transformers wandb;
            echo "Starting training with WANDB...";
            export HF_HOME=/mnt/hf_cache;
            export TRANSFORMERS_CACHE=/mnt/hf_cache;
            export WANDB_CACHE_DIR=/mnt/wandb_cache;
            wandb login ${WANDB_API_KEY} && \
            python3 train.py
        env:
        - name: WANDB_API_KEY
          valueFrom:
            secretKeyRef:
              name: wandb-secret
              key: api-key
        volumeMounts:
        - name: code
          mountPath: /workspace/nanoVLM
        - name: hf-cache
          mountPath: /mnt/hf_cache
        - name: wandb-cache
          mountPath: /mnt/wandb_cache
        - name: dshm
          mountPath: /dev/shm
        workingDir: /workspace/nanoVLM
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"
      volumes:
      - name: code
        hostPath:
          path: /home/ubuntu/Projects/nanoVLM
          type: Directory
      - name: hf-cache
        hostPath:
          path: /home/ubuntu/.cache/huggingface
          type: Directory
      - name: wandb-cache
        hostPath:
          path: /home/ubuntu/.cache/wandb
          type: Directory
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: 8GiKubernetes will retry the pod up to 4 times due to backoffLimit: 4. This is useful for controlling fault tolerance in batch jobs like training loops that might fail due to transient issues.
To launch the job, run this command:
kubectl apply -f train.yamlSometimes a pod will not start promptly. So you will see 0/1 in the “READY” column and Pending in the “STATUS” column. If you want to see details of the status of your pod, run this command:
kubectl get podsWe can also inspect a NanoVLM pod with K9s.
k9s --kubeconfig /etc/rancher/k3s/k3s.yamlYou should see output similar to the following below:
Check the logs for any startup issues.
kubectl logs -f POD_NAMERemember to replace POD_NAME with the actual name of your training pod.
Once training has finished, your job's status will show "Completed." Here's an example of the output:
NAME                  READY   STATUS      RESTARTS   AGE
nanovlm-train-5b8g2   0/1     Completed   0          47hYou'll also see output similar to this, indicating the training progress:
Loading from backbone weights
Successfully loaded google/siglip-base-patch16-224 weights from safetensors. Model has 85,797,120 parameters.
Successfully loaded HuggingFaceTB/SmolLM2-135M weights from safetensors. Model has 134,515,008 parameters.The logs will also provide insights into the training progress, including model loading details, parameter counts, and performance metrics.
nanoVLM initialized with 222,081,600 parameters
Training summary: 1655332 samples, 12932 batches/epoch, batch size 128
Validation summary: 16720 samples, 130 batches/epoch, batch size 128
Epoch 5/5, Train Loss: 0.5771 | Time: 31740.05s | T/s: 4369.14
Average time per epoch: 31879.29s
Average time per sample: 0.0193s
MMStar Accuracy: 0.1013Trends from wandb would provide a visual representation of these metrics over time. If accuracy improves, the model is checkpointed and saved locally.
This particular training run lasted approximately 1 day and 20 hours, completing 5 epochs. The MMStar accuracy achieved was 0.1013. This isn't about achieving the best model accuracy, it's about verifying that the Nvidia Jetson Orin Dev Kit can handle the computational demands of a training process using a Kubernetes setup.
Upon successful training, we can upload your newly trained model to Hugging Face for sharing and future use. Use the following Python code snippet, ensuring your checkpoint path is correct:
from models.vision_language_model import VisionLanguageModel
checkpoint = "./checkpoints/nanoVLM-222M"
model = VisionLanguageModel.from_pretrained(checkpoint)
model.push_to_hub("nanoVLM-222M")Next, to perform inference with your uploaded model, execute this command, replacing shakhizat/nanoVLM-222M with the actual model ID if different:
python3 generate.py --hf_model shakhizat/nanoVLM-222MHere is our demo image:
And here's an example of the expected output when running inference with an input image:
Using device: cuda
Loading weights from: shakhizat/nanoVLM-222M
Input:
  What is this?
Outputs:
  >> Generation 1: This picture is of outside. At the top of the picture the wall is seen. On the left
  >> Generation 2: In the foreground of this image, we can see a mannequin in grey and black color and
  >> Generation 3: This image seems to be very blurred. In the middle of the image there are two cats laying on
  >> Generation 4: At the bottom of the image, it seems like a cat in white and black fur is sitting on
  >> Generation 5: In this image we can see a cat on the ground and a tablet on the right side of theThis isn't a good result for a production-ready model. However, the primary goal of this particular exercise wasn't to achieve state-of-the-art accuracy. Instead, this training job was a crucial feasibility check to see if we could successfully run the NanoVLM model training process on the NVIDIA Jetson AGX Orin Developer Kit. This successful proof-of-concept opens the door for further optimization and development of on-device VLM training.
References:


Comments