Two weeks ago I set out to build something different from the usual AI agent demo: a system where an operator talks to a robot through voice, and the robot keeps moving while the AI thinks. Making that work meant solving real systems problems: keeping inference out of the control loop, handing off between a VLM and a real-time tracker, failing over between cloud and local backends without split-brain, and preserving context across switches. This post is about how Gemini Live API and Google Cloud helped me solve them.
HALO is open source — you can find the code at https://github.com/andrei-ace/HALO
The Stop-and-Go ProblemThe first prototype was humbling. I had a planner calling tools, a simulated arm, and the whole thing froze every time the LLM reasoned about what to do next. Three seconds of inference meant three seconds of a motionless robot. That's not how physical systems work.
The fix was architectural: split the system into two independent paths. A decision path where the LLM reasons at its own pace, and a control path where tracking, skill execution, control, and safety run continuously at their own rates — from roughly 10 Hz up to 100 Hz. The planner reads a compact snapshot of the current state and issues high-level commands. Deterministic services handle control, safety, and skill execution in real time, while async perception stays off the critical path. The two paths share state but never block each other.
That separation shaped every design decision that followed.
The Perception Problem: VLM Inference vs. Real-Time TrackingOnce the architecture was in place, I hit the next wall: perception. A vision-language model can understand a scene — "there's a red cube near the center of the table" — but it takes hundreds of milliseconds. A robot approaching an object needs tracking updates every 30-50 ms.
My first attempt had the fast tracker waiting for VLM results. Terrible. The second attempt ran them in parallel but had a gap when the VLM finished — the new tracker started from scratch and lost a few hundred milliseconds of motion context.
The solution that worked best was frame-buffer replay. While the VLM runs asynchronously, a ring buffer captures every incoming camera frame. The existing tracker keeps publishing hints uninterrupted. When the VLM completes, a new tracker is initialized from the VLM result and the buffered frames are replayed through it, fast-forwarding it to the present. Only then does the service switch over. No gap, no stale hints.
I later split VLM inference into two concurrent paths — one for scene description, one for tracking — so an operator asking "what do you see?" doesn't block target acquisition. That came from a real debugging session where scene description and target tracking were fighting over the same VLM lock.
Giving the Robot a Voice with Gemini Live APIThe next big milestone was making HALO conversational. The Gemini Live API via Google ADK turned out to be a natural fit for the Live Agent — the conversational voice layer between the operator and the system. Native audio in and out as a single streaming session, with tool-calling built in. No separate ASR to LLM to TTS pipeline to stitch together.
Separately, the task-level PlannerService is implemented as an ADK ReAct agent with three core tools: start_skill, abort_skill, and describe_scene. The Live Agent and the planner are distinct: the Live Agent handles conversation and narration, while the planner handles skill orchestration and recovery. The Live Agent can forward operator intents to the planner, but they run independently.
There was a fundamental problem with the Live Agent, though. Gemini runs in the cloud. The robot state lives locally. By the time a tool result round-trips through the cloud, the arm may have moved, the skill may have advanced, the scene may have changed.
I solved this with what I call proxy tools. On the cloud side, the tool definitions exist but the implementations are stubs. When Gemini calls a tool, a callback intercepts it, serializes the call over WebSocket to the local TUI, and the TUI executes it locally against the freshest runtime state. The result flows back and gets injected into Gemini's context. The cloud agent thinks it called a tool; the execution happened locally with current data.
This pattern also made barge-in straightforward — when Gemini detects the operator interrupting, the TUI clears the audio buffer instantly so the robot's response stops mid-sentence.
Deploying to Google Cloud (and Coming Back)Getting HALO's cloud service running on Cloud Run was straightforward — a FastAPI app with WebSocket support, containerized and deployed. Firestore stores planner session state so the system can recover after disconnects. Secret Manager keeps the Gemini API key out of the codebase. Terraform wraps all of it so a single command brings up the full stack.
The interesting engineering challenge was what happens when the cloud goes away.
Real robots can't stop when connectivity drops. So I built a Cognitive Switchboard that routes between Gemini 3.1 Flash-Lite (cloud) and Ollama with GPT-OSS 20B + Qwen2.5-VL 3B (local). The switchboard monitors health; after three consecutive cloud failures, it fails over to local. When cloud recovers, it fails back.
Split-Brain and How to Prevent It
The naive version had a serious problem: a slow Gemini response would arrive after failover and get executed by the command router alongside the local backend's commands. Two brains, one robot, and no reliable source of authority.
The fix was a LeaseManager that stamps every command with an epoch and token. On failover, the epoch increments and the old lease is revoked. The command router rejects any command from a stale epoch — even if it's a perfectly valid Gemini response that just arrived late. During switchover, the new backend is pre-warmed before the old lease is revoked, so there's no gap where neither backend is active.
Context Continuity Across Switches
Preventing split-brain was only half the problem. The other half was making the transition seamless.
The ContextStore maintains an append-only journal of decisions and events. On failover, the new backend gets a compacted summary so it can resume mid-task instead of starting cold. Both backends track a cursor so failback doesn't replay already-processed entries.
Session Compaction: When Context Windows Fill UpLong-running Gemini sessions hit another problem: the context window fills up. The Live Agent accumulates audio transcripts, tool calls, tool results, and conversation history. In longer sessions, context growth started to hurt responsiveness and overall session quality.
I built a compaction plugin using ADK's BasePlugin and EventCompaction primitives. After each agent turn, it checks if the session has grown past a threshold. If so, it finds a compaction boundary, summarizes everything older via LLM, and keeps a configurable overlap window of recent turns. The result is a compact summary that stays within budget without losing the task-level context the agent needs to continue.
The same compaction feeds into the cross-backend sync: when Gemini compacts a session, the summary propagates to the local backend's history through the ContextStore, so failover doesn't lose the conversation.
Skills as Diagrams, Not Code
I wanted adding a new robot skill to be easy — define the state machine, write handlers, done. So I made skills Mermaid diagrams. Each skill (pick, place, track) is authored as a stateDiagram-v2 file. The FSM engine parses Mermaid into an immutable graph at startup and executes it with pluggable per-state handlers.
This made failure recovery explicit. The pick skill has recovery states (RECOVER_RETRY_APPROACH, RECOVER_REGRASP, RECOVER_ABORT) right in the diagram with guarded transitions. The VERIFY_GRASP phase checks if the object's Z position changed during lift — a deterministic check that doesn't involve the LLM. Only when the FSM exhausts its local recovery options does it escalate to the planner.
Generating Training Data in SimulationThe MuJoCo simulation with an SO-101 arm model is where everything comes together. I also wanted to generate demonstration episodes for future imitation learning, and that turned out to be harder than the runtime itself.
Early trajectories collided with the table, had jerky motion, or produced unreliable grasps. Over several iterations I built a pipeline that scores grasp candidates, generates keyframes, solves inverse kinematics with fallbacks, produces jerk-limited trajectory profiles, and validates every segment for clearance. The result is smooth, physically plausible demonstrations across randomized object placements.
Three Takeaways from Building a Real-Time AgentIf I had to distill three weeks of HALO into advice:
Architecture matters more than model size. A fast, small planner that sees the right abstraction is better than a giant model drowning in raw telemetry. The planner never sees numeric target vectors or joint positions — it sees something like "distance: 0.12 m, confidence: 0.87, phase: EXECUTE_APPROACH" — and that's enough.
Design for failure from day one. Not just model failures — network failures, VLM timeouts, grasp failures, context overflow. Every boundary in the system has a fallback path. That's what makes it feel robust instead of fragile.
Keep the LLM out of the hot path. The moment you put inference in a timing-critical loop, you've built a demo, not a system. Let the LLM think asynchronously and let deterministic code handle real-time.
HALO is still early, but the architecture already works end to end — from voice to planning to real-time control — and I'm excited to keep pushing it toward real hardware.
Demo: https://www.youtube.com/watch?v=hIvHln6MW2w
Code: https://github.com/andrei-ace/HALO
Devpost: https://devpost.com/software/halo-kolu5w
Built for the Gemini Live Agent Challenge. This post was written as part of our entry to the hackathon.
#GeminiLiveAgentChallenge #Robotics #AI #GeminiAPI #GoogleCloud #MuJoCo #OpenSource










Comments