The Object of My Detection
F-VLM enables open-vocabulary object detection without the model retraining and extensive manual dataset annotation traditionally required.
In machine learning, object detection refers to the process of identifying and localizing specific objects within an image or video. It plays a crucial role in computer vision applications as it enables machines to perceive and understand their surroundings. By accurately detecting objects, machines can extract relevant information and make informed decisions.
Object detection has applications in various domains, including autonomous vehicles, surveillance systems, medical imaging, and robotics. It allows self-driving cars to identify pedestrians and obstacles, assists in tracking objects in security cameras, aids medical professionals in diagnosing diseases, and enables robots to interact with their environment effectively.
The process of collecting and annotating datasets to train these models has been a limiting factor in their utility, however. First, a diverse range of images is necessary to ensure that the models can generalize and identify objects under various conditions. Once collected, the images must be annotated to provide ground truth labels for training the models. Annotation involves manually marking and labeling objects within the images, often with bounding boxes that precisely encompass the object's boundaries. In some cases, additional information, such as segmentation masks or key points, may be annotated to capture finer details. This annotation process requires human expertise, time, and meticulous attention to detail.
Recently, vision and language models (VLMs) that have been trained on Internet-scale image-text pairs have emerged with the ability to perform zero-shot classifications of image types that were not in their training data. While this feat has been achieved for classification only, a team at Google Research reasoned that these models should have information relevant to object shapes and region classifications encoded within them. And that information could be used for object detection, while also leveraging the zero-shot detection capabilities of VLMs.
The proposed F-VLM approach utilizes a frozen VLM image encoder as the detector backbone and a text encoder for caching detection text embeddings of an offline dataset's vocabulary. The VLM backbone is combined with a detector head, responsible for predicting object regions for localization and outputting detection scores indicating the probability of a detected box belonging to a specific category. These detection scores are determined using the cosine similarity between region features (bounding boxes) and category text embeddings obtained by feeding category names through the text model of the pretrained VLM.
The new method was tested with a popular open-vocabulary detection benchmarking suite. F-VLM was found to far outperform present state of the art systems in average precision when detecting rare object categories. F-VLM was shown to correctly detect both novel and common objects, without requiring any model retraining on domain-specific datasets. And because the system relies on pretrained VLMs, it was found to be hundreds of times more cost efficient in training the initial model when compared with existing approaches.
The researchers hope that their work will facilitate further research in novel-object detection, and also help the community in leveraging VLMs for a wider range of computer vision tasks. Towards this end, they have released their source code under a permissive license, and have also made some demos available on their project page.