Now You’re Speaking My Language
Teaching LLMs to map their knowledge to real-world objects allows the wealth of information they contain to be applied to robot control.
Recent breakthroughs in the field of artificial intelligence have shown that large language models are capable of performing complex tasks that were once thought to be the sole domain of humans. These models are trained on vast amounts of text data, allowing them to understand and generate human-like language. This ability has been leveraged to perform a wide range of tasks, from writing essays and poetry to translating languages and even coding.
With popular chatbots like OpenAI’s ChatGPT making huge waves in the business world and even in popular culture, large language models appear to be poised to transform many aspects of our lives. However, as the novelty of these models starts to wear off, their limitations will start to become more apparent. One such limitation is that, while these models can chat competently about many aspects of the real world, they have no way of connecting those representations to the actual, physical objects that they represent.
This has implications for the application of language models in robotics. If you were to ask a chatbot to provide you with a procedure to sort a pile of objects into different groups by color, for example, it might provide you with some reasonably good advice. But if that question were posed to a language model running on a robot with computer vision capabilities, it would have no way of linking that request with visual information to provide the robot with a control plan that could carry out that task.
You might say that a language model is simply the wrong tool for the job. But there is a tremendous amount of information about the world encoded in these models, so finding a way to leverage them, with some additional knowledge about how objects in the real world, as perceived by sensors, relate to textual descriptions could be very powerful. A multidisciplinary team led by researchers at Robotics at Google believes that this is the case, and they have been working to develop an embodied multimodal language model called PaLM-E that incorporates real-world continuous sensor data into language models to bridge the present disconnect.
The team began with the pre-trained, 562 billion parameter PaLM model. An encoder was stacked on top of this model to translate inputs such as images and state estimates into the same embedding as the language tokens that PaLM normally accepts. The pipeline was then trained to output a sequence of decisions as natural text. This sequence can be interpreted by a robot to create and execute a movement policy in accordance with the model’s decisions.
A series of experiments were conducted with two robots under the control of PaLM-E — one a tabletop arm, and another that was fully mobile. The robots were given tasks like "bring me the rice chips from the drawer" or “move the yellow hexagon to the green star” via text prompts. During the course of these trials, it was found that a single model was able to achieve high performance not only on a variety of tasks, but also for different types of robots.
The researchers have released lots of demonstrations of their robots in action on GitHub. They are very impressive, be sure to check them out.