Showing Robots What Matters Most

Clio leverages advances in computer vision and LLMs to help robots focus on the objects around them that matter most for the task at hand.

Nick Bild
1 year agoRobotics
Clio helps robots to see the world through an appropriate lens (📷: D. Maggio et al.)

The world is filled with endless possibilities. This is one of the biggest reasons why robots that are designed to perform a narrowly-focused task are often very effective, while robots designed for more general-purpose roles struggle mightily. Consider for a moment a robotic perception system operating in the kitchen of your home. There are a dozen packets of sauce clustered together on the table. Should the robot perceive them as a single pile of packets, or should it recognize each one individually?

The answer is an emphatic “it depends.” If the robot is tasked with cleaning all of the packets off of the table, then it is simpler and more efficient to detect them all as a group to be swept away. But if the robot needs to put the sauce on a plate of food, then an individual packet must be identified before it is picked up. It is clear that the way that a robot views the world needs to be shaped by what it is trying to accomplish. Yet, given the vast array of possibilities, it is completely impractical to manually program its perception system to view the world through any possible lens.

A group of engineers at MIT is working to tackle this problem and bring us one step closer to a world in which robots can jump from one task to another as easily as we can. They have developed a novel framework called Clio that helps robots to focus only on objects that matter in a given context. It does this by quickly mapping a three-dimensional scene and identifying only the objects — at an appropriate level of granularity — that are important for completing a specific task.

The team built upon recent work in the area of open-set object recognition. These deep learning algorithms are trained on billions of images and their associated textual captions. This helps them to learn to identify segments of images that correspond to a wide range of objects — not just a relative handful of objects like algorithms of the past. Moreover, they learn to recognize objects at different levels of granularity, as in the case of the sauce packet example previously mentioned.

Using this technology, the next question was how to shape its perception for a given task. Their approach involved the use of both cutting-edge computer vision models and large language models. The large language model processes natural language instructions and helps the robot understand what needs to be done. Mapping tools then break down the visual scene into small segments, which are analyzed to find semantic similarities based on the task. The “information bottleneck” principle is then applied to compress the visual data by filtering out irrelevant segments, retaining only those most pertinent to the task. This combination allows Clio to tune its focus on the right level of granularity, isolating and identifying objects essential to completing the task, while disregarding unnecessary details.

To validate their approach, the researchers deployed Clio on a Boston Dynamics robot dog. After instructing the robot to carry out a specific set of tasks, it explored an office building to map it. After that, it was found that Clio could pick out the segments of the scenes that were relevant to each task. Moreover, Clio was able to run locally, onboard the robot’s computer, demonstrating that it is practical for real-world use.

So far, Clio has been used to complete relatively simple tasks. But looking ahead, the team hopes to allow for more complex tasks to be carried out by building upon recent advances in photorealistic visual scene representations.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles