Introduction
What is RAG
What are some of the common use cases of RAG
Deployment
ARM Architecture Overview
NVIDIA Jetson TX2 Overview
Image and Text Embedding Using CLIP
Storing These Vectors in a Vector Database
Performing Similarity Search
Retrieval Based on Prompt
Future Scope
Conclusion

Published July 9, 2024 © CC BY

Exploring RAG on NVIDIA Jetson TX2

In this blog, we'll delve into the innovative use of the NVIDIA Jetson TX2 to power Retrieval-Augmented Generation (RAG) for LLMs

IntermediateFull instructions provided7 hours225

Things used in this project

Hardware components

Nvidia Jetson TX2

Software apps and online services

Microsoft VS Code

PyTorch with Cuda

Story

Introduction

In this blog, we will delve into the innovative use of the NVIDIA Jetson TX2 to power Retrieval-Augmented Generation (RAG) for large language models (LLMs). Focusing on the core concepts and the retrieval process, we will explore how this powerful platform enhances the efficiency and performance of RAG, unlocking new possibilities in AI applications. Join me as we uncover the potential of the Jetson TX2 in transforming the landscape of LLMs.

What is RAG?

Retrieval-Augmented Generation (RAG) combines the strengths of language models with the ability to search for relevant information. Imagine asking a question. Instead of solely relying on its internal knowledge, the system can search for additional details to provide a more informed and accurate response.

Overview of RAG process

This approach helps in situations where the model needs to understand a topic deeply or provide information it might not have learned directly. By using external sources like databases or the internet, RAG ensures that the information it generates is up-to-date and comprehensive, making it useful for tasks like answering complex questions or creating content that requires detailed knowledge.

What are some of the common use cases of RAG?

Retrieval-Augmented Generation (RAG) finds application in various fields where contextually rich and accurate text generation is crucial. Some common use cases include:

Question Answering Systems(Chatbots): Enhancing the ability of AI systems to provide precise answers by retrieving relevant information from a vast knowledge base.

Content Creation: Generating articles, summaries, or reports that require up-to-date and comprehensive information sourced from external databases or the internet.

The entire project is deployed on the NVIDIA Jetson TX2, which operates on ARM architecture. Let's quickly delve into the fundamentals of ARM architecture and the NVIDIA Jetson TX2.

Deployment

ARM Architecture Overview

ARM (Advanced RISC Machine) architecture is a family of reduced instruction set computing (RISC) architectures primarily used in embedded systems, mobile devices, and increasingly in high-performance computing applications. Key features include:

RISC Principles: ARM processors use RISC principles, emphasizing simplicity and efficiency in instruction sets to maximize performance per watt.

Scalability: ARM architecture is highly scalable, ranging from simple microcontrollers to powerful multicore processors used in servers and supercomputers.
Energy Efficiency: Designed for low power consumption, ARM processors are ideal for portable devices and embedded systems, offering extended battery life and heat efficiency.

NVIDIA Jetson TX2 Overview

The NVIDIA Jetson TX2 is an embedded system-on-module (SoM) designed for AI and edge computing applications. It supports frameworks like TensorFlow, PyTorch, and NVIDIA TensorRT for optimized deep learning inference.

This is my NVIDIA Jetson TX2

To fully utilize the GPU, it is essential to set up the necessary configurations. NVIDIA provides CUDA, a parallel computing platform and (API) model, which allows developers to harness the power of the GPU for general-purpose processing.

CPU: Dual-core NVIDIA Denver 2 and quad-core ARM Cortex-A57 CPUs, providing a balance of performance and power efficiency.

GPU/RAM utilization

GPU: NVIDIA Pascal architecture with 256 CUDA cores, enabling accelerated processing for AI tasks like deep learning and computer vision.

The combination of ARM architecture and NVIDIA GPU technology in the Jetson TX2 delivers a robust platform for deploying AI at the edge, offering high performance in a compact and energy-efficient form factor.

Before we dive into the hands-on experiment, let's first explore a few concepts regarding the preprocessing steps that occur before retrieval.

Image and Text Embedding Using CLIP

CLIP (Contrastive Language-Image Pretraining) is a model developed by OpenAI that can understand images and text jointly by learning from large amounts of text-image pairs. Here's how embedding works with CLIP:

Image Embedding: CLIP converts images into high-dimensional vectors (embeddings) that represent their content in a continuous vector space. These embeddings capture semantic information about the images.

Text Embedding: Similarly, CLIP can encode text (such as captions or prompts) into embeddings. These embeddings capture the meaning and context of the text in relation to images.In this example I will be embedding 8 images of a person performing various activities:

These are the images that are embedded using CLIP

Check out the paper that provides an in-depth understanding of the CLIP model architecture. [LINK]

Storing These Vectors in a Vector Database

Among various popular vector databases shown below, I have chosen FAISS (Facebook AI Similarity Search) due to its seamless integration with leading machine learning frameworks such as PyTorch.

Popular vector DBs

Vector Storage: FAISS allows storing large sets of embeddings efficiently in memory or on disk, optimized for fast retrieval.

Here's how vectors from CLIP are stored and managed using FAISS:

Embedded 8 images, dimensions of the vectors are 512

Indexing: FAISS creates an index structure (e.g., an inverted index or a graph-based index) that enables fast similarity calculations among vectors.If I have a PDF that I want to embed and store in a vector database, the text embeddings will be stored in text_embeddings.faiss, and the image embeddings will be stored in image_embeddings.faiss. When I input a prompt, it is vectorized, and a similarity search is performed on both the image and text FAISS indices.

Performing Similarity Search

Similarity search involves finding vectors in a database that are most similar to a given query vector. Common techniques include:

Cosine Similarity: Measures the cosine of the angle between two vectors, indicating their similarity based on direction. This is the most popular technique, and I have used it in the implementation shown.

cosine similarity formula is equivalent to:

Euclidean Distance: Measures the straight-line distance between two vectors in the vector space.

Other Techniques: Some advanced techniques include using learned distance metrics (e.g., learned embeddings), approximate nearest neighbor algorithms (e.g., locality-sensitive hashing), or graph-based methods (e.g., k-nearest neighbors).

Retrieval Based on Prompt

The process starts by vectorizing the prompt. A similarity search is then performed against the vector database, identifying the most similar embeddings. Since we are only embedding images in this scenario, only the most similar images will be retrieved based on the calculated similarities.

Threshold Setting: Setting a similarity threshold to filter results based on relevance. For example, here I have set the cosine similarity threshold to 0.2. Only data with a value greater than that is fetched.

Ranking: Ranking retrieved vectors based on their similarity scores or relevance metrics. Only the top-k values are retrieved. In our example, I have set the value of k to 1, so only one image is retrieved.

Retrieved images with the above prompts

This process allows efficient retrieval of relevant information or content based on user-defined criteria, leveraging the power of vector embeddings and similarity search techniques.

Future Scope

This concludes the retrieval part of the Retrieval-Augmented Generation (RAG) process. The retrieved data will then be sent to a multimodal Large Language Model (LLM). By providing the prompt and the retrieved data as input to the multimodal LLM, we can significantly enhance its answering capabilities

Due to the limitations of the Jetson TX2, which has only 8 GB of shared GPU memory, I will be deploying the system on a more powerful GPU setup. This will ensure the LLM can be loaded and run efficiently, taking full advantage of the enhanced retrieval process. I will demonstrate this in my next blog.

Conclusion

In this blog, we explored the state-of-the-art CLIP model for embedding both images and text. The focus was on utilizing CLIP solely for embedding purposes. We conducted a similarity search to retrieve the best matching embeddings. The entire application runs on the Nvidia Jetson TX2, which is built on Arm architecture. While embedding operations take only a few seconds, the similarity search and retrieval completes in less than a second. This demonstrates the Jetson TX2's capability to efficiently handle such tasks.

In the upcoming blog, I will delve deeper into the generation phase using a Large Language Model (LLM) to complete the entire Retrieval-Augmented Generation (RAG) process.

import os
import torch
import numpy as np
from PIL import Image
from transformers import CLIPProcessor, CLIPModel, CLIPTokenizer
import faiss
import pickle
from dotenv import load_dotenv

load_dotenv()

def get_model_info(model_ID, device):
    model = CLIPModel.from_pretrained(model_ID).to(device)
    processor = CLIPProcessor.from_pretrained(model_ID)
    tokenizer = CLIPTokenizer.from_pretrained(model_ID)
    return model, processor, tokenizer

def get_single_text_embedding(text, model, tokenizer, device):
    inputs = tokenizer(text, return_tensors="pt").to(device)
    text_embeddings = model.get_text_features(**inputs)
    embedding_as_np = text_embeddings.cpu().detach().numpy()
    embedding_as_np = embedding_as_np / np.linalg.norm(embedding_as_np)
    return embedding_as_np

def get_single_image_embedding(image_path, model, processor, device):
    image = Image.open(image_path).convert("RGB")
    inputs = processor(images=image, return_tensors="pt").to(device)
    image_embeddings = model.get_image_features(**inputs)
    embedding_as_np = image_embeddings.cpu().detach().numpy()
    embedding_as_np = embedding_as_np / np.linalg.norm(embedding_as_np)
    return embedding_as_np

device = "cuda" if torch.cuda.is_available() else "cpu"
model_ID = "openai/clip-vit-base-patch32"
model, processor, tokenizer = get_model_info(model_ID, device)

folder_path = r"Folder path containing images"
image_embeddings = []
image_paths = []

for filename in os.listdir(folder_path):
    if filename.endswith(".jpg") or filename.endswith(".jpeg"):
        image_path = os.path.join(folder_path, filename)
        image_embedding = get_single_image_embedding(image_path, model, processor, device)
        image_embeddings.append(image_embedding)
        image_paths.append(image_path)

image_embeddings = np.vstack(image_embeddings)
dimension = image_embeddings.shape[1]

# Create a FAISS index and add image embeddings
index = faiss.IndexFlatIP(dimension)
index.add(image_embeddings)

faiss.write_index(index, "image_paths_clip.faiss")

print(f"Index type: {type(index)}")
print(f"Number of vectors: {index.ntotal}")
print(f"Dimensionality of vectors: {index.d}")

with open("image_paths_clip.pkl", "wb") as f:
    pickle.dump(image_paths, f)

import os
import torch
import numpy as np
from PIL import Image
from transformers import CLIPProcessor, CLIPModel, CLIPTokenizer
import faiss
import pickle
import matplotlib.pyplot as plt
from dotenv import load_dotenv

# Load environment variables if using dotenv
load_dotenv()

# Function to retrieve CLIP model, processor, and tokenizer
def get_model_info(model_ID, device):
    model = CLIPModel.from_pretrained(model_ID).to(device)
    processor = CLIPProcessor.from_pretrained(model_ID)
    tokenizer = CLIPTokenizer.from_pretrained(model_ID)
    return model, processor, tokenizer

# Function to get text embedding
def get_single_text_embedding(text, model, tokenizer, device):
    inputs = tokenizer(text, return_tensors="pt").to(device)
    text_embeddings = model.get_text_features(**inputs)
    embedding_as_np = text_embeddings.cpu().detach().numpy()
    embedding_as_np = embedding_as_np / np.linalg.norm(embedding_as_np)
    return embedding_as_np

# Function to query FAISS index and retrieve matching image paths
def query_index(query_text, model, tokenizer, index, image_paths, device, k=1, threshold=0.2):
    text_embedding = get_single_text_embedding(query_text, model, tokenizer, device)
    D, I = index.search(text_embedding, k)
    
    retrieved_indices = I[0]
    retrieved_distances = D[0]
    
    results = []
    for idx, dist in zip(retrieved_indices, retrieved_distances):
        if dist >= threshold:
            matching_image_path = image_paths[idx]
            results.append((matching_image_path, dist))
    
    return results

# Check if CUDA is available, otherwise use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model_ID = "openai/clip-vit-base-patch32"

# Load CLIP model, processor, and tokenizer
model, processor, tokenizer = get_model_info(model_ID, device)

# Load FAISS index and image paths using pickle
index = faiss.read_index("image_paths_clip.faiss")
with open("image_paths_clip.pkl", "rb") as f:
    image_paths = pickle.load(f)

# Example text prompt for querying
text_prompt = "Prompt here"

# Perform query and retrieve results
results = query_index(text_prompt, model, tokenizer, index, image_paths, device, k=1)

# Display results using Matplotlib
if not results:
    print("No matching image path found.")
else:
    for result in results:
        matching_image_path, distance = result
        print(f"Matching Image Path: {matching_image_path}, Distance (Cosine Similarity): {distance}")
        if os.path.exists(matching_image_path):
            matching_image = Image.open(matching_image_path)
            
            # Display image using Matplotlib
            plt.imshow(matching_image)
            plt.title(f"Prompt: {text_prompt}\nDistance (Cosine Similarity): {distance:.2f}")
            plt.axis('off')  # Hide axes
            plt.show()

Credits

Dhruv Bhat

1 project • 1 follower

Software Developer

Exploring RAG on NVIDIA Jetson TX2