By Your Command

From mobile manipulators to drones, robots can follow natural language instructions with the VLMaps framework.

Nick Bild
1 year agoMachine Learning & AI
Robot path planning with natural language prompts (📷: Google Research)

As artificial intelligence has advanced, things that were once thought impossible have become possible. These advances have led to a growing interest in developing robots that can understand and follow natural language instructions. While this technology has the potential to revolutionize various industries, including manufacturing, healthcare, and transportation, there are still significant challenges to overcome in terms of robot navigation.

To build a working navigation system, the robot needs to have some abilities that, while they come naturally to humans, are very difficult to reproduce artificially. First, the robot must be able to understand natural language descriptions and how they relate to objects that are perceived visually. Further, the robot must also be capable of spatial reasoning to connect a representation of an environment on a map to the actual spatial distribution of objects in the real world.

For example, a person might say "go to the kitchen and get me the cup of coffee on the table," but what does "go to the kitchen" mean? Which path should the robot take to get there? And how does it know where the kitchen is in the first place? For that matter, how does it know what a table is, or how to find a cup (whatever that is) on top of it? This is very challenging to untangle computationally and turn into a set of steps for a robot to follow, yet it is a very simple set of instructions.

A step towards a world in which robots are at our beck and call has been taken through a collaboration between Google Research, the University of Freiburg, and the University of Technology Nuremberg. They have developed a framework called Visual Language Maps (VLMaps) that creates maps of an environment that are fused with visual-language embeddings. And once a robot can identify visual landmarks in an image and associate them with natural language descriptions, it becomes much easier to craft a set of steps that can achieve the goal.

Recent work by other groups has greatly advanced the state of the art in large language models that can understand even very complex natural language instructions. Other research efforts have gone one step further in creating joint visual-language models that can associate natural language descriptions with the things that they describe in images. But building the capability to connect those objects with the actual objects in 3D space in the real world has remained elusive. And that gap is where VLMaps has the most to contribute.

As a robot powered by VLMaps moves about, a pre-trained visual-language model integrates natural language descriptions into a 3D simulated recreation of the environment. Leveraging this model, the user’s instructions can be matched up with objects in the simulated environment. Coordinates can be obtained for the identified objects of interest, which are then fed into a Code as Policies algorithm that processes spatial goals to create a robot motion plan.

Of course different types of robots have different capabilities, so depending on the particulars of the platform, the motion plan will need to differ. For example, a wheeled robot will be blocked by a table, but a drone can simply fly over it. VLMaps took this into account with the team’s decision to generate open-vocabulary obstacle maps that allow users to specify types of objects that a robot can and cannot traverse. This additional information is considered during the path planning process, and has been shown to increase navigation efficiency as compared to when a single navigation map is used for multiple robot types.

The VLMaps framework was put to the test in a simulated environment against the CoW and LM-Nav navigation systems. VLMaps was found to outperform these existing systems by as much as 17%, thanks in large part to its better understanding of the relationships between language and real-world objects.

The researchers are presently exploring how their system might be able to effectively function in dynamic environments, such as one in which people and objects are in motion. If you are interested in digging further into the details of this project, everything has been open-sourced, so you are free to tinker around with it to your heart’s content.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles