A Textbook Approach to Navigation

MIT's robot navigation method uses LLMs and textual captions from images to reduce computational workloads and training data requirements.

Nick Bild
1 month agoRobotics

The dense information provided by computer vision systems is a major factor in the successes of modern autonomous robots. Yet this rich source of information is also an Achilles’ heel in these same applications. High-resolution images provide computer systems with vast amounts of information about their surroundings, allowing them to locate objects of interest, calculate a safe path for navigation, and avoid obstacles. But these images contain many millions of individual pixels, each of which must be evaluated by an algorithm tens of times per second.

Processing requirements such as these not only increase the cost, size, and power consumption of a robot, but they also significantly limit what applications can be achieved practically. Furthermore, these algorithms also typically require massive amounts of training data, which can be very tough to come by. Unfortunately, that means the general-purpose service robots that we dream of having in our homes will remain nothing more than a dream until more efficient sensing mechanisms are developed. You will just have to fold your own laundry and cook your own meals for the time being.

A team from MIT and the MIT-IBM Watson AI Lab may not have solved this problem just yet, but they have moved the field forward with the development of a very novel robot navigation scheme. Their approach minimizes the use of visual information, and instead relies on the knowledge of the world that is contained in large language models (LLMs), to plan multi-step navigation tasks. Spoiler alert — this approach does not perform as well as state-of-the-art computer vision algorithms, but it does significantly reduce the computational workload and reduce the volume of training data that is needed. And these factors make the new navigation method ideal for a number of use cases.

For starters, the new system captures an image of the robot’s surroundings. But rather than using the pixel-level data for navigation, it instead uses an off-the-shelf vision model to produce a textual caption of the scene. This caption is then fed into an LLM, along with a set of instructions provided by an operator that describe the task to be carried out. The LLM then predicts the next action that the robot should take to achieve its goal. After the next action is complete, the process starts over, working iteratively to ultimately complete the task.

Testing showed that this method did not perform as well as a purely vision-based approach, which is not a surprise. However, it was demonstrated that when given only 10 real-world visual trajectories, the approach could quickly generate over 10,000 synthetic trajectories to use for training, thanks to the relatively lightweight algorithm. This could help to bridge the gap between simulated environments (where many algorithms are trained) and the real world to improve robot performance. Another nice benefit of this approach is that the model’s reasoning is easier for humans to understand, since it natively uses natural language.

As a next step, the researchers want to develop a navigation-oriented captioning algorithm — rather than using an off-the-shelf solution — to see if that might enhance the system’s performance. They also intend to explore the ability of LLMs to exhibit spatial awareness to better understand how that might be exploited to enhance navigation accuracy.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles