SAM: Segment Anything Model — Versatile by itself and Faster by OpenVINO
Learn more about Intel OpenVino's Segment Anything Model (SAM) and how you can use it in your projects.
Original Article written by Dr. Paula Ramos — AI Evangelist at Intel (LinkedIn Profile)
SAM, also known as the Segment Anything Model, is gaining popularity due to its excellent performance and versatility in computer vision segmentation tasks. This article aims to provide an overview of SAM, including its fundamentals, how to execute it, and how to optimize its performance using OpenVINO. Since its launch on April 5th of 2023, we could see how the computer vision community is taking advantage of it to create segmentation in medical imaging, satellite imaging and also improve the techniques of 3D rendering.
Before SAM — What is image segmentation?
But what is image segmentation? This exits for a while; unlike traditional image classification techniques, which predict information from the complete image, the segmentation tasks classify the pixels and cluster them, creating a segmentation mask where you can identify the different elements in the image. And with the pixels identified and localized, it could help you to remove the background or make inpainting and remove those objects from the original image. In image generation, we can see how valid the segmentation maps are in projects such as GAuGAN, where we could get a realistic image based on a segmentation map.
Meta offers us an open-source algorithm, model, and dataset called SAM (Segment Anything) that classifies pixels and clusters them to create a segmentation mask that identifies different elements in the image. But the big difference with other segmentation algorithms is we do not have to run supervised training with assigned labels.
What is SAM, and How does it work?
SAM is an algorithm for image segmentation that can identify and segment any object in an image without the need for prior training or assigned labels.
SAM is a real game-changer when it comes to segmentation techniques. It is a single model capable of both interactive and automatic segmentation. It is designed to produce top-quality object masks based on input prompts like points or boxes. It’s so versatile that it can even generate masks for all objects in an image. The Segment Anything Model has been trained on a massive dataset of 11 million images and 1.1 billion masks. With that kind of training, you can bet that it has some serious skills for zero-shot performance on various segmentation tasks.
This model comprises three parts, each working together to create a seamless and highly effective segmentation process. The Segment Anything Model is truly a marvel of modern technology!
- Image Encoder — Vision Transformer model (VIT) pre-trained using Masked Auto Encoders approach (MAE) for encoding the image to embedding space. The image encoder runs once per image and can be applied prior to prompting the model.
- Prompt Encoder — Encoder for segmentation condition. As a condition can be used: 1) points — set of points related to an object which should be segmented. The prompt encoder converts points to embedding using positional encoding. 2) boxes — bounding boxes where the object for segmentation is located. Similar to points, the bounding box coordinates are encoded via positional encoding. 3) The segmentation mask — provided by the user, is embedded using convolutions and summed element-wise with the image embedding. And 4) text — encoded by CLIP model text representation (This is still a Proof of concept)
- Mask Decoder — The mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask.
Source SAM paper link here.
This embedding effectively allows the model to create high-quality masks based on a prompt. The model provides numerous masks that correspond to the prompt and its score. As illustrated in the diagram, the masks may have overlapping regions. This can be advantageous for complex scenarios where the prompt could be interpreted in various ways, such as segmenting an entire object, just a specific portion, or when the prompt indicates the intersection of multiple objects. The model’s promptable interface allows for versatility in its usage, enabling a diverse range of segmentation tasks to be accomplished by designing appropriate prompts for the model, including clicks, boxes, text, and other inputs.
Why is SAM important for developers?
SAM is based on a foundational model trained on a large dataset of 1 billion masks over 11 million images. It can generalize the notion of objects, allowing it to deploy its skills in contexts that it has never seen during training. SAM’s ability to segment an object it has never seen before during training is known as Zero-Shot Learning, a key feature of deep learning.
SAM is versatile and can be used with different types of prompts, including mark points, bounding boxes, and masks. In addition, SAM’s experimental feature of indicating a description, such as “I want the cartwheel to be segmented,” could serve as a bridge to larger language models like GPT-4, allowing artificial intelligence to define which elements of an image to segment.
SAM is open-source and offers an exponential leap in the use of image segmentation tools, with the ability to identify more than 500 masks in an image. It can replace manual labeling and speed up the labeling task or be used with a labeling tool. SAM’s Zero-Shot capability also makes it helpful in fine-tuning your dataset.
As DL continues to scale, SAM automates the understanding of data, making it useful for virtual reality hardware and augmented reality applications. That is the main reason Meta is doing this, to be one step closer to understanding the interaction of human objects through a first-person video.
Make SAM faster with OpenVINO.
I invite you to visit our repository OpenVINO Notebooks, where you can find many demos, tutorials, and examples of the most popular AI solutions. One is the SAM Notebook, where you can run the model using your CPU or integrate/discrete GPU.
Here you will find the repository: https://github.com/openvinotoolkit/openvino_notebooks
Here you will find the SAM Notebook: https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/237-segment-anything (Credits Ekaterina Aidova)
Here is one of my results:
You can run this model using a CPU or even a GPU. Your question could be, how fast is it on the CPU?
Let me explain to you first. In the notebook, we use the smaller version of the encoder-decoder, where we have three elements image encoder, prompt encoder, and mask decoder. Still, the last two are running as a mask prediction. It took 30ms on my setup. The image encoder does not require running it every time, only when we have a new image. It took 130 ms on my setup. I achieved that result using my 12th Gen Core i7, and this is much faster than the time reposted in the original paper, where they mention 50ms for running the decoding part.
Where is SAM used in the industry?
Medical imaging: https://ai.papers.bar/paper/694d9f7ee31211edb95839eec3084ddd
NeRF improvement: https://ai.papers.bar/paper/45670144e30e11edb95839eec3084ddd
This blog could be very helpful for those who want to learn, play, and innovate with SAM and optimize processes with OpenVINO. Feel free to post exciting things you segmented, this is new, and we are learning together. Please take it to the real-world solution level and optimize inferencing for your model with OpenVINO. There’s so much you can create …bring use cases back and inspire them to build solutions.
The benefit of AI is within everyone’s reach and can help us to improve our lives. Taking an active role in the development of AI will help ensure that it benefits society.
My invitation to you is to try the OpenVINO notebooks. Stay tuned because more related content will be coming to my channels. If you have any issues, please add your questions in the discussion section of our repository https://github.com/openvinotoolkit/openvino_notebooks/discussions
Enjoy the blog and enjoy the notebooks! 😊
#iamintel #openvino #generativeai #segmentanything #openvino
Hi, all! My name is Paula Ramos. I have been an AI enthusiast and have worked with Computer Vision since the early 2000s. Developing novel integrated engineering technologies is my passion. I love to deploy solutions that real people can use to solve their equally real problems. If you’d like to share your ideas on how we could improve our community content, drop me a line! 😉 I will be happy to hear your feedback.
Here is my LinkedIn profile: https://www.linkedin.com/in/paula-ramos-41097319/
Notices & Disclaimers: Intel technologies may require enabled hardware, software, or service activation. No product or component can be absolutely secure. Your costs and results may vary. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.