Clearing Up Double Vision in Robots

Caltech's VPEngine is a framework that makes robots faster and more efficient by cutting out redundant visual processing tasks.

Nick Bild
3 months agoRobotics

In the world of robotics, visual perception is not a single task. Rather, it is composed of a number of subtasks, ranging from feature extraction to image segmentation, depth estimation, and object detection. Each of these subtasks typically executes in isolation, after which the individual results are merged together to contribute to a robot’s overall understanding of its environment.

This arrangement gets the job done, but it is not especially efficient. Many of the underlying machine learning models need to do some of the same steps — like feature extraction — before they move on to their task-specific components. That not only wastes time, but for robots running on battery power, it also limits the time they can operate between charges.

A group of researchers at the California Institute of Technology came up with a clever solution to this problem that they call the Visual Perception Engine (VPEngine). It is a modular framework that was created to enable efficient GPU usage for visual multitasking while maintaining extensibility and developer accessibility. VPEngine leverages a shared backbone and parallelization to eliminate unnecessary GPU-CPU memory transfers and other computational redundancies.

At the core of VPEngine is a foundation model — in their implementation, DINOv2 — that extracts rich visual features from images. Instead of running multiple perception models in sequence, each repeating the same feature extraction process, VPEngine computes those features once and shares them across multiple task-specific “head” models. These head models are lightweight and specialize in functions such as depth estimation, semantic segmentation, or object detection.

The team designed the framework with several key requirements in mind: fast inference for quick responses, predictable memory usage for reliable long-term deployment, flexibility for different robotic applications, and dynamic task prioritization. The last of these is particularly important, as robots often need to shift their focus depending on context — for instance, prioritizing obstacle avoidance in cluttered environments or focusing on semantic understanding when interacting with humans.

VPEngine achieves much of its efficiency by making heavy use of NVIDIA’s CUDA Multi-Process Service. This allows the separate task heads to run in parallel, ensuring high GPU utilization while avoiding bottlenecks. The researchers also built custom inter-process communication tools so that GPU memory could be shared directly between processes without costly transfers. Each module runs independently, meaning that a failure in one perception task will not bring down the entire system, which is an important consideration for safety and reliability.

On the NVIDIA Jetson Orin AGX platform, the team achieved real-time performance at 50 Hz or greater with TensorRT-optimized models. Compared to traditional sequential execution, VPEngine delivered up to a threefold speedup while maintaining a constant memory footprint.

Beyond performance, the framework is also designed to be developer-friendly. Written in Python with C++ bindings for ROS2, it is open source and highly modular, enabling rapid prototyping and customization for a wide variety of robotic platforms.

By cutting out redundant computation and enabling smarter multitasking, the VPEngine framework could help robots become faster, more power-efficient, and ultimately more capable in dynamic environments.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles