Meta's OpenEQA Benchmark for Embodied AI Finds Current Vision and Language Models Are "Nearly Blind"
Current vision plus language models (VLMs) failing to take advantage of the visual information available, Meta's new benchmark finds.
Facebook parent Meta has announced the release of a benchmark designed to aid the development of better vision and language models (VLMs) for physical spatial awareness in smart robots and more: OpenEQA, the Open-Vocabulary Embodied Awareness Question Answering benchmark.
"We benchmarked state-of-art vision+language models (VLMs) and found a significant gap between human-level performance and even the best models. In fact, for questions that require spatial understanding, today’s VLMs are nearly 'blind' — access to visual content provides no significant improvement over language-only models," Meta's researchers claim of their work. "We hope releasing OpenEQA will help motivate and facilitate open research into helping AI [Artificial Intelligence] agents understand and communicate about the world it sees, an essential component for artificial general intelligence.
Developed by corresponding author Aravind Rajeswaran and colleagues at Meta's Fundamental AI Research (FAIR) arm, OpenEQA aims to deliver a benchmark for measuring just how well a model can address questions relating to visual information — in particular, their ability to build a model of their surroundings and use that information to respond to user queries. The goal: the development of "embodied AI agents," in everything from ambulatory smart home robots to wearables, which can actually respond usefully to prompts involving spatial awareness and visual data.
The OpenEQA benchmark puts models to work on two tasks. The first is to determine its episodic memory, searching through previously-recorded data for an answer to a query. The second is what Meta terms "active EQA," which sends the agent — in this case, necessarily ambulatory — on a hunt through its physical environment for data that will provide an answer to the user's prompt, such as "where did I leave my badge?"
"We used OpenEQA to benchmark several state-of-art vision + language foundation models (VLMs) and found a significant gap between even the most performant models ([OpenAI's] GPT-4V at 48.5 percent) and human performance (85.9 percent)," the researchers note. "Of particular interest, for questions that require spatial understanding, even the best VLMs are nearly 'blind' — i.e., they perform not much better than text-only models, indicating that models leveraging visual information aren't substantially benefiting from it and are falling back on priors about the world captured in text to answer visual questions."
"As an example," the researchers continue, "for the question 'I'm sitting on the living room couch watching TV. Which room is directly behind me?', the models guess different rooms essentially at random without significantly benefiting from visual episodic memory that should provide an understanding of the space. This suggests that additional improvement on both perception and reasoning fronts are needed before embodied AI agents powered by such models are ready for primetime."
More information on OpenEQA, including an open-access paper detailing the work, is available on the project website; the source code and dataset have been published to GitHub under the permissive MIT license.