Published December 29, 2024 © GPL3+

Running Microsoft's Phi 3.5 Vision on Nvidia Jetson Platform

Bringing Vision and Language Together on NVIDIA Jetson AGX Orin Developer Kit using Microsoft Phi-3.5-Vision

AdvancedFull instructions provided3 hours1,082

Running Microsoft's Phi 3.5 Vision on Nvidia Jetson Platform

Things used in this project

Hardware components

NVIDIA Jetson AGX Orin Developer Kit

Story

Vision language models (VLMs) are a powerful new class of AI models that can understand and process both visual and textual information. This capability makes them ideal for a wide range of applications, from image captioning and visual question answering to robotic control and autonomous vehicles.

One of the latest advancements in VLMs is the Microsoft Phi-3.5-Vision model. This highly compressed and optimized model is designed to run on edge devices with limited memory and computational resources, making it perfect for real-world applications.

This tutorial will guide you through the process of running Microsoft Phi-3.5-Vision on the NVIDIA Jetson AGX Orin 64GB Developer Kit using ONNX Runtime GenAI (link) and TensorRT-LLM(link).

Running Phi-3.5-vision via ONNX Runtime GenAI

Special thanks to Kunal Vaishnavi from Microsoft for his assistance in getting Phi-3.5-vision up and running on the NVIDIA Jetson AGX Orin 64GB Developer Kit.

ONNX Runtime is an open-source framework that allows you to run machine learning models in a variety of environments, including edge devices. To begin, we'll install ONNX Runtime and build ONNX Runtime GenAI:

Download and install the ONNX Runtime.whl file

pip install http://108.39.248.12:81/jp6/cu126/+f/0c4/18beb3326027d/onnxruntime_gpu-1.20.0-cp310-cp310-linux_aarch64.whl

Clone ONNX Runtime GenAI and prepare folders

git clone https://github.com/microsoft/onnxruntime-genai
cd onnxruntime-genai
mkdir -p ort/include/
mkdir -p ort/lib/

Find where the.whl is installed

Name: onnxruntime-gpu
Version: 1.20.0
Summary: ONNX Runtime is a runtime accelerator for Machine Learning models
Home-page: https://onnxruntime.ai
Author: Microsoft Corporation
Author-email: onnxruntime@microsoft.com
License: MIT License
Location: /home/jetson/.local/lib/python3.10/site-packages
Requires: coloredlogs, flatbuffers, numpy, packaging, protobuf, sympy
Required-by:

Copy shared libraries to ort/lib/

cp /home/jetson/.local/lib/python3.10/site-packages/onnxruntime/capi/libonnxruntime*.so* ort/lib/

Download header files to copy to ort/include/ folder

cd ort/include/
wget https://raw.githubusercontent.com/microsoft/onnxruntime/rel-1.20.0/include/onnxruntime/core/session/onnxruntime_c_api.h
wget https://raw.githubusercontent.com/microsoft/onnxruntime/rel-1.20.0/include/onnxruntime/core/session/onnxruntime_float16.h

Create a symbolic link:

ln -s /home/jetson/Projects/onnxruntime-genai/ort/lib/libonnxruntime.so.1.20.0 /home/jetson/Projects/onnxruntime-genai/ort/lib/libonnxruntime.so

Build ONNX Runtime GenAI from source using the following command, specifying the appropriate CUDA version and paths:

python3 build.py --use_cuda --cuda_home /usr/local/cuda-12.6 --config Release --ort_home ./ort --skip_tests --parallel

The build process should output the following:

[100%] Built target onnxruntime-genai
[100%] Linking CUDA shared module onnxruntime_genai.cpython-310-aarch64-linux-gnu.so
lto-wrapper: warning: using serial compilation of 6 LTRANS jobs
Copying files to wheel directory: /home/jetson/Projects/onnxruntime-genai/build/Linux/Release/wheel/onnxruntime_genai
[100%] Built target python
[100%] Building wheel on /home/jetson/Projects/onnxruntime-genai/build/Linux/Release/wheel
Processing /home/jetson/Projects/onnxruntime-genai/build/Linux/Release/wheel
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: onnxruntime-genai-cuda
  Building wheel for onnxruntime-genai-cuda (setup.py) ... done
  Created wheel for onnxruntime-genai-cuda: filename=onnxruntime_genai_cuda-0.6.0.dev0-cp310-cp310-linux_aarch64.whl size=3620020 sha256=2f8cc34c586536e0090a79bf0791476483b8e272289019abfd2407c40696e417
  Stored in directory: /tmp/pip-ephem-wheel-cache-zk6pno6c/wheels/81/66/19/2f6df378bb018e430d402b3aa78c90c3d8f151cdace18340e1
Successfully built onnxruntime-genai-cuda
[100%] Built target PyPackageBuild

Navigate to the directory where the.whl file is located and install it using pip:

cd /home/jetson/Projects/onnxruntime-genai/build/Linux/Release/wheel
pip3 install *.whl

Use the following Python command to verify if the installation was successful:

python3 -c 'import onnxruntime_genai; print(onnxruntime_genai.Model.device_type)'

This should output:

<property object at 0xffff8d395120>

Download the onnx version of the Phi-3.5-Vision model.

huggingface-cli download microsoft/Phi-3.5-vision-instruct-onnx --local-dir ./Phi-3.5-vision-instruct-onnx

Download the example script

wget https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3v.py

Run the inference using the below command:

python phi3v.py -m ./gpu/gpu-int4-rtn-block-32/   -p cuda

This command will execute the Phi-3.5-Vision model and generate a caption for a given image.

Demonstrating the capabilities of Phi-3.5-Vision using ONNX Runtime GenAI.

Source of image: https://www.pexels.com/photo/pavement-ends-road-signage-1554766/

TensorRT-LLM is another option for running VLMs on edge devices, providing a high-performance inference engine specifically designed for large language models on NVIDIA GPUs.

Running Phi-3.5-Vision via TensorRT-LLM

You can easily get started with TensorRT-LLM by following the documentation on the Nvidia Jetson AI Lab page: https://www.jetson-ai-lab.com/tensorrt_llm.html

Source of image: https://www.jetson-ai-lab.com/index.html

To run Microsoft Phi-3.5-Vision on the NVIDIA Jetson AGX Orin using TensorRT-LLM, follow these steps:

First, download the Phi-3.5-Vision model using the Hugging Face CLI. Run the following command to download the model to a local directory:

huggingface-cli download microsoft/Phi-3.5-vision-instruct --local-dir ./Phi-3.5-vision-instruct

Edit the config.json file and change the attn_implementation parameter from flash_attention_2 to eager.

Then, convert the checkpoint using the convert_checkpoint.py script:

python3 ./TensorRT-LLM/examples/phi/convert_checkpoint.py \
    --model_dir ./Phi-3-vision-128k-instruct \
    --output_dir ./Phi-3-vision-128k-instruct-convert \
    --dtype float16

Next, build the TensorRT engine using the trtllm-build command:

trtllm-build \
    --checkpoint_dir ./Phi-3-vision-128k-instruct-convert \
    --output_dir ./Phi-3-vision-128k-instruct-engine \
    --gpt_attention_plugin float16 \
    --gemm_plugin float16 \
    --max_batch_size 1 \
    --max_input_len 4096 \
    --max_seq_len 4608 \
    --max_multimodal_len 4096

This command generates TensorRT engines for the visual components and combines everything into a final pipeline.

Then, build the visual engine using the build_visual_engine.py script:

python3 ./TensorRT-LLM/examples/multimodal/build_visual_engine.py \
--model_type phi-3-vision \
--model_path ./Phi-3-vision-128k-instruct \
--output_dir ./Phi-3-vision-128k-instruct-vision_encoder

Finally, run the model using the run.py script:

python3 ./TensorRT-LLM/examples/multimodal/run.py \
	--hf_model_dir ./Phi-3.5-vision-instruct \
	--visual_engine_dir ./Phi-3.5-vision-instruct-vision_encoder \
	--llm_engine_dir ./Phi-3.5-vision-instruct-engine  \
	--image_path=./test3.jpg \
	--input_text "Describe the image"

This command executes the Phi-3.5-Vision model on the specified input image and generates a caption, demonstrating the model's capability to understand and describe visual content.

Example #1: Using an image of a road signage, the model generates a caption describing the scene and OCR (Optical Character Recognition) functionality.

Source of image: https://www.pexels.com/photo/pavement-ends-road-signage-1554766/

Output

Example #2: With an image of two retriever puppies, the model produces a caption describing the image content.

Source of image: https://www.pexels.com/photo/two-yellow-labrador-retriever-puppies-1108099/

Output:

Both methods, via ONNX Runtime and TensorRT-LLM demonstrated decent inference speeds, highlighting the efficiency of running Phi-3.5-Vision on the NVIDIA Jetson AGX Orin.

This enables you to develop innovative edge computing applications that combine visual and language understanding. Whether you're working on image captioning, visual question answering, or other creative projects, Phi-3.5-Vision provides a versatile foundation for unlocking the potential of vision language models at the edge.

I hope you found this guide useful and thanks for reading. If you have any questions or feedback? Leave a comment below. If you like this post, please support me by subscribing to my blog.

References:

Nurgaliyev Shakhizat

81 projects • 210 followers

I am a hardcore robotics and IoT enthusiast. Email: shahizat005@gmail.com

Running Microsoft's Phi 3.5 Vision on Nvidia Jetson Platform

Things used in this project

Hardware components

Story

Running Phi-3.5-vision via ONNX Runtime GenAI

Running Phi-3.5-Vision via TensorRT-LLM

References:

Credits

Nurgaliyev Shakhizat

Comments

Embed the widget on your own site

Running Microsoft's Phi 3.5 Vision on Nvidia Jetson Platform

Running Microsoft's Phi 3.5 Vision on Nvidia Jetson Platform

Things used in this project

Hardware components

Story

Running Phi-3.5-vision via ONNX Runtime GenAI

Running Phi-3.5-Vision via TensorRT-LLM

References:

Credits

Nurgaliyev Shakhizat

Comments

Related channels and tags