Vision language models (VLMs) are a powerful new class of AI models that can understand and process both visual and textual information. This capability makes them ideal for a wide range of applications, from image captioning and visual question answering to robotic control and autonomous vehicles.
One of the latest advancements in VLMs is the Microsoft Phi-3.5-Vision model. This highly compressed and optimized model is designed to run on edge devices with limited memory and computational resources, making it perfect for real-world applications.
This tutorial will guide you through the process of running Microsoft Phi-3.5-Vision on the NVIDIA Jetson AGX Orin 64GB Developer Kit using ONNX Runtime GenAI (link) and TensorRT-LLM(link).
Running Phi-3.5-vision via ONNX Runtime GenAISpecial thanks to Kunal Vaishnavi from Microsoft for his assistance in getting Phi-3.5-vision up and running on the NVIDIA Jetson AGX Orin 64GB Developer Kit.
ONNX Runtime is an open-source framework that allows you to run machine learning models in a variety of environments, including edge devices. To begin, we'll install ONNX Runtime and build ONNX Runtime GenAI:
Download and install the ONNX Runtime.whl file
pip install http://108.39.248.12:81/jp6/cu126/+f/0c4/18beb3326027d/onnxruntime_gpu-1.20.0-cp310-cp310-linux_aarch64.whlClone ONNX Runtime GenAI and prepare folders
git clone https://github.com/microsoft/onnxruntime-genai
cd onnxruntime-genai
mkdir -p ort/include/
mkdir -p ort/lib/Find where the.whl is installed
Name: onnxruntime-gpu
Version: 1.20.0
Summary: ONNX Runtime is a runtime accelerator for Machine Learning models
Home-page: https://onnxruntime.ai
Author: Microsoft Corporation
Author-email: onnxruntime@microsoft.com
License: MIT License
Location: /home/jetson/.local/lib/python3.10/site-packages
Requires: coloredlogs, flatbuffers, numpy, packaging, protobuf, sympy
Required-by:Copy shared libraries to ort/lib/
cp /home/jetson/.local/lib/python3.10/site-packages/onnxruntime/capi/libonnxruntime*.so* ort/lib/Download header files to copy to ort/include/ folder
cd ort/include/
wget https://raw.githubusercontent.com/microsoft/onnxruntime/rel-1.20.0/include/onnxruntime/core/session/onnxruntime_c_api.h
wget https://raw.githubusercontent.com/microsoft/onnxruntime/rel-1.20.0/include/onnxruntime/core/session/onnxruntime_float16.hCreate a symbolic link:
ln -s /home/jetson/Projects/onnxruntime-genai/ort/lib/libonnxruntime.so.1.20.0 /home/jetson/Projects/onnxruntime-genai/ort/lib/libonnxruntime.soBuild ONNX Runtime GenAI from source using the following command, specifying the appropriate CUDA version and paths:
python3 build.py --use_cuda --cuda_home /usr/local/cuda-12.6 --config Release --ort_home ./ort --skip_tests --parallelThe build process should output the following:
[100%] Built target onnxruntime-genai
[100%] Linking CUDA shared module onnxruntime_genai.cpython-310-aarch64-linux-gnu.so
lto-wrapper: warning: using serial compilation of 6 LTRANS jobs
Copying files to wheel directory: /home/jetson/Projects/onnxruntime-genai/build/Linux/Release/wheel/onnxruntime_genai
[100%] Built target python
[100%] Building wheel on /home/jetson/Projects/onnxruntime-genai/build/Linux/Release/wheel
Processing /home/jetson/Projects/onnxruntime-genai/build/Linux/Release/wheel
Preparing metadata (setup.py) ... done
Building wheels for collected packages: onnxruntime-genai-cuda
Building wheel for onnxruntime-genai-cuda (setup.py) ... done
Created wheel for onnxruntime-genai-cuda: filename=onnxruntime_genai_cuda-0.6.0.dev0-cp310-cp310-linux_aarch64.whl size=3620020 sha256=2f8cc34c586536e0090a79bf0791476483b8e272289019abfd2407c40696e417
Stored in directory: /tmp/pip-ephem-wheel-cache-zk6pno6c/wheels/81/66/19/2f6df378bb018e430d402b3aa78c90c3d8f151cdace18340e1
Successfully built onnxruntime-genai-cuda
[100%] Built target PyPackageBuildNavigate to the directory where the.whl file is located and install it using pip:
cd /home/jetson/Projects/onnxruntime-genai/build/Linux/Release/wheel
pip3 install *.whlUse the following Python command to verify if the installation was successful:
python3 -c 'import onnxruntime_genai; print(onnxruntime_genai.Model.device_type)'This should output:
<property object at 0xffff8d395120>Download the onnx version of the Phi-3.5-Vision model.
huggingface-cli download microsoft/Phi-3.5-vision-instruct-onnx --local-dir ./Phi-3.5-vision-instruct-onnxDownload the example script
wget https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3v.pyRun the inference using the below command:
python phi3v.py -m ./gpu/gpu-int4-rtn-block-32/ -p cudaThis command will execute the Phi-3.5-Vision model and generate a caption for a given image.
Demonstrating the capabilities of Phi-3.5-Vision using ONNX Runtime GenAI.
TensorRT-LLM is another option for running VLMs on edge devices, providing a high-performance inference engine specifically designed for large language models on NVIDIA GPUs.
Running Phi-3.5-Vision via TensorRT-LLMYou can easily get started with TensorRT-LLM by following the documentation on the Nvidia Jetson AI Lab page: https://www.jetson-ai-lab.com/tensorrt_llm.html
To run Microsoft Phi-3.5-Vision on the NVIDIA Jetson AGX Orin using TensorRT-LLM, follow these steps:
First, download the Phi-3.5-Vision model using the Hugging Face CLI. Run the following command to download the model to a local directory:
huggingface-cli download microsoft/Phi-3.5-vision-instruct --local-dir ./Phi-3.5-vision-instructEdit the config.json file and change the attn_implementation parameter from flash_attention_2 to eager.
Then, convert the checkpoint using the convert_checkpoint.py script:
python3 ./TensorRT-LLM/examples/phi/convert_checkpoint.py \
--model_dir ./Phi-3-vision-128k-instruct \
--output_dir ./Phi-3-vision-128k-instruct-convert \
--dtype float16Next, build the TensorRT engine using the trtllm-build command:
trtllm-build \
--checkpoint_dir ./Phi-3-vision-128k-instruct-convert \
--output_dir ./Phi-3-vision-128k-instruct-engine \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--max_batch_size 1 \
--max_input_len 4096 \
--max_seq_len 4608 \
--max_multimodal_len 4096This command generates TensorRT engines for the visual components and combines everything into a final pipeline.
Then, build the visual engine using the build_visual_engine.py script:
python3 ./TensorRT-LLM/examples/multimodal/build_visual_engine.py \
--model_type phi-3-vision \
--model_path ./Phi-3-vision-128k-instruct \
--output_dir ./Phi-3-vision-128k-instruct-vision_encoderFinally, run the model using the run.py script:
python3 ./TensorRT-LLM/examples/multimodal/run.py \
--hf_model_dir ./Phi-3.5-vision-instruct \
--visual_engine_dir ./Phi-3.5-vision-instruct-vision_encoder \
--llm_engine_dir ./Phi-3.5-vision-instruct-engine \
--image_path=./test3.jpg \
--input_text "Describe the image"This command executes the Phi-3.5-Vision model on the specified input image and generates a caption, demonstrating the model's capability to understand and describe visual content.
Example #1: Using an image of a road signage, the model generates a caption describing the scene and OCR (Optical Character Recognition) functionality.
Output
Example #2: With an image of two retriever puppies, the model produces a caption describing the image content.
Output:
Both methods, via ONNX Runtime and TensorRT-LLM demonstrated decent inference speeds, highlighting the efficiency of running Phi-3.5-Vision on the NVIDIA Jetson AGX Orin.
This enables you to develop innovative edge computing applications that combine visual and language understanding. Whether you're working on image captioning, visual question answering, or other creative projects, Phi-3.5-Vision provides a versatile foundation for unlocking the potential of vision language models at the edge.
I hope you found this guide useful and thanks for reading. If you have any questions or feedback? Leave a comment below. If you like this post, please support me by subscribing to my blog.
References:

Comments