A Trip to ReMEmbR
ReMEmbR enables robots to build and query a long-term memory of their visual experiences to aid in navigation and complex reasoning.
As they are deployed to an increasingly diverse range of locations — such as warehouses, homes, and office buildings — more will be expected of robots. The decades-long expectation that a robot needs to do nothing more than repetitively carry out the same task, over and over, no longer holds. Advances in machine learning, especially generative AI, have played a large role in our shifting expectations for our mechanical friends. We now want to be able to interact with robots in a more natural way, much like we can with large language models (LLMs). But are today’s technologies up to the task?
Not entirely. In order to really understand their environment, robots must be able to perceive all sorts of events and objects and also somehow encode that information and remember it for long periods of time. But existing methods of representation often fall flat, and there is no effective way to retrieve data about what a robot has encountered over periods of hours or days. But researchers at NVIDIA, the University of Southern California, and the University of Texas at Austin are working to change that with a system that they call Retrieval-augmented Memory for Embodied Robots (ReMEmbR). ReMEmbR was designed for long-horizon video question answering to aid in robot navigation.
ReMEmbR operates in two phases: memory-building and querying. In the memory-building phase, the system captures short video segments from the robot's environment and uses a vision-language model, such as NVIDIA's VILA, to generate descriptive captions for these segments. These captions, along with associated timestamps and spatial coordinates, are then embedded into the vector database MilvusDB. This embedding process converts the textual and visual information into vectors, allowing for efficient storage and retrieval. By organizing memory in this structured way, ReMEmbR enables robots to maintain a scalable, long-horizon semantic memory that can be easily queried later on.
During the querying phase, an LLM-based agent interacts with the memory to answer user questions. When a question is posed (e.g., "Where is the nearest elevator?"), the LLM generates a series of queries to the vector database, retrieving relevant information based on text descriptions, timestamps, or spatial coordinates. The LLM iteratively refines its queries until it gathers enough context to provide a comprehensive answer. This process allows the robot to perform complex reasoning tasks, taking into account both spatial and temporal aspects of its experiences. The LLM can be implemented using NVIDIA NIM microservices, on-device LLMs, or other LLM APIs, ensuring flexibility and adaptability in how the system processes and retrieves information.
To demonstrate ReMEmbR in action, the team built their system into a physical Nova Carter robot. The integration process involved several key steps. First, they built an occupancy grid map of the robot's environment using 3D LIDAR and odometry data, which provided the necessary global pose information for navigation. Next, they populated a vector database by teleoperating the robot, during which the VILA model generated captions for the robot's camera images. These captions, along with pose and timestamp data, were embedded and stored in the database. Once the database was ready, the ReMEmbR agent was activated to handle user queries. The agent processed these queries by retrieving relevant information from the database and determining the appropriate actions, such as guiding the robot to specific locations. To enhance user interaction, they incorporated speech recognition into the system, allowing users to issue voice commands to the robot.
You really have to see the robot in action to understand how natural interactions with ReMEmbR can be. Be sure to check out the video below for a glimpse of what is possible.