Visual question answering (VQA) is a challenging task that involves answering questions about an image. For example, given an image of a dog, a VQA system should be able to answer questions such as "what color is the dog?" or "what breed is the dog?".
The potential applications of this technology are numerous. Image search engines could use it to provide more accurate search results, while self-driving cars could use it to better understand their surroundings. It could also have applications in fields such as robotics, where robots could use visual question answering to better understand their environment and interact with humans.
Major advances in large language models have been made in recent months, which has given them a remarkable ability to reason and produce human-like text. The past several years have also seen huge improvements in computer vision capabilities, yielding excellent accuracy in tasks like object recognition and depth estimation But when it comes to combining these technologies, such that machines can reason about the real world, and generalized VQA can be realized, we are still in the early stages.
Previous methods have generally relied on an end-to-end approach, in which all tasks, from reasoning and responding, to recognizing visual objects, must all be performed within the forward pass of a neural network. Unfortunately, this approach fails to leverage the incredible progress that has been made in the individual tasks that compose the overall problem, like object detection. This means models must relearn solutions to problems that have all but been solved, and they also grow increasingly data hungry. More data means more compute cycles, and more compute cycles means more expense and less accessibility.
Recognizing that we are not on a sustainable path towards generalized VQA models, researchers from Columbia University have devised a clever plan to simplify the process. Rather than relearning to do the things that we already know how to do, they have decided to let each type of model do what it does best, and create a framework that helps them to play together nicely.
The solution, called ViperGPT, leverages a code generating large language model, such as GPT-3 Codex. Using the knowledge about the real world contained in this model, it writes custom Python programs. These programs accept an image or video as an input, then execute the generated program to answer the question that was posed. They do this by using existing models that were designed specifically for the use case at hand.
ViperGPT is very flexible, in that any conceivable vision or language model could be utilized in VQA tasks. Very importantly, it requires no training, whatsoever. With no shiny, new end-to-end model architectures to train from scratch, existing building blocks, that are already trained, can be leveraged. It is also notable that the results produced by ViperGPT are highly interpretable — the Python code lays out the precise logic that is used in responding to queries.
As an example of the capabilities of ViperGPT, the researchers fed the model a picture of two children with some muffins on a table in front of them. They asked the model how many muffins each child could eat for it to be fair. This prompt resulted in a Python function being written that would open the image, count the number of children detected, and the number of muffins detected. It would then return the number of muffins divided by the number of children, ensuring each child gets the same amount.
In another demonstration, an image of several drinks was provided, along with a question asking which drink has zero alcohol in it. Again, a Python function was generated that read in an image, but then it got a little bit more interesting. It first searches out all drinks in the image, then, for each one, it runs another model to determine what it is called. With that name, it again calls the language model to ask if that drink contains alcohol.
ViperGPT proposes a very interesting path forward. Rather than reinvent the wheel for each new task, or try to build a single model that can do it all, this new method can let each algorithm do what it does best. And by combining them in the right way, it unlocks new types of functionality that none of the models are individually capable of.