Robots Now Think for Themselves

Google's Gemini Robotics On-Device VLA model gives robots a deep understanding of the world around them, and it runs entirely locally.

6 months ago • Robotics

Gemini Robotics On-Device helps robots understand the world (📷: Google DeepMind)

Just a few months ago, Google DeepMind released a pair of new vision language action (VLA) models called Gemini Robotics that, as the name implies, are designed to give robots multimodal reasoning capabilities. VLA models such as these break large language models free from their confinement to the digital realm by giving them a deep understanding of the physical world through information found in text, images, audio, and video. This understanding of the real world can be leveraged by robots to do everything from making deliveries to making pancakes.

The initial Gemini Robotics release relied on some pretty hefty models that could only run on powerful computing systems. For robots with limited onboard resources, that means connecting to remote data centers in the cloud for processing. But what if the robot does not have access to the internet, or only has intermittent access? And what about situations where real-time operating requirements do not allow for the network latency introduced by this architecture?

The model can be fine-tuned for a wide range of tasks (📷: Google DeepMind)

Until now, you would have been out of luck if you wanted to use Gemini Robotics for these applications. But now, the team at DeepMind has introduced Gemini Robotics On-Device. Like the previous models, On-Device is a powerful VLA that helps robots understand the world around them. But in this case, the model has also been heavily optimized so that it can run directly on the robot’s onboard hardware — no network connection needed.

Despite its smaller footprint, Gemini Robotics On-Device has been demonstrated to deliver impressive performance. It exhibits strong generalization across a range of complex real-world tasks and responds to natural language instructions with precision. Tasks like unzipping bags, folding clothes, and assembling industrial components can now be performed with a high degree of dexterity — all without relying on remote servers.

This robot is packing a gift bag (📷: Google DeepMind)

DeepMind is also launching a Gemini Robotics SDK, allowing developers to evaluate the model in simulated environments using the MuJoCo physics engine and quickly fine-tune it for their own specific use cases. It has been shown that the model can adapt to new tasks using just 50 to 100 demonstration examples.

Aside from adapting to new tasks, the On-Device model can also adapt to different robot types. Though originally trained on ALOHA robots, the model has been successfully fine-tuned to control other robotic systems like the dual-arm Franka FR3 and the Apollo humanoid by Apptronik. In each case, it maintained its ability to generalize across different tasks.

With Gemini Robotics On-Device, DeepMind is bringing cutting-edge AI capabilities directly to the machines that need them, untethering robots from the cloud and pushing the limits of what they can do autonomously.

machine learning

artificial intelligence

robotics

computer vision

Nick Bild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Robots Now Think for Themselves

Google's Gemini Robotics On-Device VLA model gives robots a deep understanding of the world around them, and it runs entirely locally.

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles