I discovered this while creating a vision-driven UI agent (code) that runs entirely locally on a MacBook Pro M4 with 48 GB RAM. It uses:
gpt-oss:20b(via Ollama) for deciding the next action,qwen3-vl:30b(via Ollama) for interpreting screenshots and returning structured UI observations plus bounding boxes,- a handful of tools:
describe_webpage,get_coordinates_for,click, andwrite- all driving real OS-level mouse and keyboard events viapyautogui.
The setup feels rock-solid in short demos. Longer runs, however, reveal the cracks: action loops, stale assumptions, and growing inconsistency.
This post explains exactly why that happens and the simplest fix that made the agent behave predictably on extended tasks.
The failure mode: fixation loops and context overrunsEvery action changes the UI. The agent needs the *latest* state to decide correctly.
Most frameworks simply append every tool output to a growing conversation history. In my case, the main culprit was the per-step UI observation from `describe_webpage: a dense, structured report of panels, fields, blockers, headers, and more.
Those reports were deliberately rich because UI automation lives and dies by edge cases - overlapping modals, multi-month calendars with duplicate day numbers, subtle overlays. The detail helped the reasoning model craft precise follow-up requests for visual grounding.
Yet the very richness became toxic when repeated. After a dozen steps the model saw contradictory states (“Calendar open” and “Calendar closed”), duplicate labels, and stale observations that looked authoritative. The result? Fixation loops, wrong clicks on vanished elements, token pressure, higher latency, and occasional hard failures.
The naive fix that failed: “just make the description shorter”I first tried trimming the reports. Token count dropped, but performance collapsed: weaker element-localization requests, poorer grounding, and more “element not found” errors.
I faced a clear tension—I needed rich descriptions for accurate grounding, yet I couldn’t keep every historical observation.
The real fix: deprecate stale UI observations (middleware)The key insight: For UI state, freshness beats history.
Screenshots and DOM observations go stale almost instantly. So I kept the current report detailed and replaced every older one with a short sentinel:
[DEPRECATED — superseded by a more recent observation]
This preserves the event timeline and debugging clarity while stripping out the misleading, token-heavy content.
Implementation: DeprecateOldScreenshotsMiddlewareA simple pre-model hook (LangChain-style middleware) rewrites the message list right before the LLM call:
from langchain.agents.middleware import AgentMiddleware, ModelRequest, ModelResponse
from langchain_core.messages import ToolMessage
class DeprecateOldScreenshotsMiddleware(AgentMiddleware):
@staticmethod
def _deprecate(messages: list) -> list:
idx = [
i for i, m in enumerate(messages)
if isinstance(m, ToolMessage) and getattr(m, "name", None) == "describe_webpage"]
for i in idx[:-1]:
old = messages[i]
messages[i] = ToolMessage(
content="[DEPRECATED — superseded by a more recent screenshot]",
tool_call_id=old.tool_call_id,
name=old.name)
return messages
def wrap_model_call(self, request: ModelRequest, handler) -> ModelResponse:
msgs = self._deprecate(list(request.messages))
return handler(request.override(messages=msgs))
# async version omitted for brevityAttach it when creating the agent and the problem largely disappears.
Why this works so well- Removes stale “truth” – Old reports look factual, so the model stops arguing with itself.
- Keeps rich current observations – The latest screen-state report stays detailed, exactly what grounding needs.
- Preserves chronology without bloat – You keep the sequence of events but collapse the payload.
Use domain-specific pruning (like screenshot deprecation) + general trimming (keep last N tokens) + optional summaries/retrieval for long-lived facts. This combination scales reliably.
A simple checklist for UI/web agentsIf your agent starts looping or acting strangely, check these first:
- Are multiple past UI states still in context and able to contradict each other?
- Do tool outputs (especially screenshot reports or DOM dumps) dominate the token budget?
- Do you explicitly prioritize the latest screen-state report?
- When clicks fail, do you re-observe and re-ground instead of retrying stale coordinates?
- Does the current description contain enough anchors for calendars, repeated labels, multi-panel layouts, and overlays?
- Are blockers, modals, scroll position, and focus checked before every action?
- Do you detect repeated (tool, args, UI-hash) patterns and force recovery?
- Do you maintain a compact task state (e.g., “destination set”, “dates confirmed”) separate from raw tool outputs?
- After domain pruning, do you still enforce a hard context-window cap?
- Are full tool outputs logged to disk for debugging instead of living forever in the prompt?
The difference between an agent that feels competent and one that spirals is rarely a better model. It’s treating context as engineered state.
If your agent observes a changing UI, don’t archive every observation. Keep the *shape* of history—but only the latest truth.
Optional next-level improvements- Hash screenshots; reuse the previous observation if unchanged (instant loop breaker).
- Store full observation and grounding outputs to disk/logs, never in the model context.
- Add a loop detector based on (action, parameters, UI-hash).
- Summarize long-lived decisions into a compact state object.



Comments