Building robots that can operate in unconstrained 3D settings is of great interest to many, due to the myriad of applications and opportunities it can unlock. Unlike controlled environments, such as factories or laboratories, where robots are typically deployed, the real world is filled with complex and unstructured environments. By enabling robots to navigate and perform tasks in these realistic settings, we empower them to interact with the world in a manner similar to humans, opening up a wide range of new and interesting possibilities.
However, achieving the capability for robots to operate in real-world 3D settings is highly challenging. These environments present a multitude of uncertainties, including unpredictable terrain, changing lighting conditions, dynamic obstacles, and unstructured environments. Robots must possess advanced perception capabilities to understand and interpret their surroundings accurately. And critically, they need to navigate efficiently and adaptively plan their actions based on real-time sensory information.
Most commonly, robots designed to interact with an unstructured environment leverage one or more cameras to collect information about their surroundings. These images are then directly processed to provide the raw inputs to algorithms that determine the best plan of action for the robot to achieve its goals. These methods have been very successful when it comes to relatively simple pick-and-place and object rearrangement tasks, but where reasoning in three dimensions is needed, they begin to break down.
To improve upon this situation, a number of methods have been proposed that first create a 3D representation of the robot’s surroundings, then use that information to inform the robot’s actions. Such techniques have certainly proven to perform better than direct image processing-based methods, but they come at a cost. In particular, the computation cost is much higher, which means the hardware needed to power the robots is more expensive and energy-hungry. This factor also hinders rapid development and prototyping activities, in addition to limiting system scalability.
This long-standing trade-off between performance and accuracy may soon vanish, thanks to the recent work of a team at NVIDIA. They have developed a method they call Robotic View Transformer (RVT) that leverages a transformer-based machine learning model that is ideally suited for 3D manipulation tasks. And when compared with existing solutions, RVT systems can be trained faster, have a higher inference speed, and achieve higher rates of success on a wide range of tasks.
RVT is a view-based approach that leverages inputs from multiple cameras (or in some cases, a single camera). Using this data, it attends over multiple views of the scene to aggregate information across views. This information is used to produce view-wise heatmaps, which in turn are used to predict the optimal position the robot should be in to accomplish its goal.
One of the key insights that made RVT possible is the use of what they call virtual views. Rather than feeding the raw images from the cameras directly into the processing pipeline, the images are first rendered into these virtual views that can provide a number of benefits. For example, the cameras may not be able to capture the best angle for every task, but a virtual view can be constructed, using the actual images, that provides a better, more informative angle. Naturally, the better the raw data that is fed into the system, the better the results can be.
RVT was benchmarked in simulated environments using RLBench and compared with the state of the art PerAct system for robotic manipulation. Across 18 tasks, with 249 variations, RVT was found to perform very well, outperforming PerAct with a success rate that was 26% higher on average. Model training was also observed to be 36 times faster using the new techniques, which is a huge boon to research and development efforts. These improvements also came with a speed boost at inference time — RVT was demonstrated to run 2.3 times faster.
Some real-world tasks were also tested out with a physical robot — activities ranging from stacking blocks to putting objects in a drawer. High rates of success were generally seen across these tasks, and importantly, the robot only needed to be shown a few demonstrations of a task to learn to perform it.
At present, RVT requires the calibration of extrinsics from the camera to the robot base before it can be used. The researchers are exploring ways to remove this constraint in the future.