A dedicated edge device that collects logs from inference nodes, diagnoses hardware failures using a RAG-powered field manual, and sends structured alerts over satellite — running a 4B hybrid Mamba2/transformer model locally.
The Deployment ProblemImagine a dozen Jetson Orin Nanos deployed on an offshore platform, running YOLOv8 inference on camera feeds 24/7 to monitor equipment, safety perimeters, and flare stacks. The nearest technician is a helicopter ride away, and satellite uplink costs dollars per megabyte.
These devices fail in predictable ways. A Jetson running three camera pipelines will eventually overheat, throttle its GPU clocks, miss inference deadlines, and shed load. The logs tell the full story — but correlating events across three different log formats and deciding what to do about it requires either a human on-site or an expensive satellite link to a cloud LLM.
The deployment model is simple: inference nodes only run the camera pipelines, while a separate monitoring Jetson collects their logs and runs the agent. It reads the logs, looks up problems in a field manual, and sends a structured email over satellite — a diagnosis, root cause, and specific fix. Not megabytes of raw telemetry.
Why Nemotron-3 Nano Made This PracticalThe monitoring Jetson has 8 GB of unified memory. The agent stack needs an LLM, an embedding model (BAAI/bge-small-en-v1.5, ~50 MB), a cross-encoder re-ranker (Xenova/ms-marco-MiniLM-L-6-v2, ~80 MB), a FAISS index, and a Python runtime. The LLM gets the full GPU since this is a dedicated monitoring node — but 8 GB is still a hard ceiling.
Log excerpts, field manual sections, tool metadata, system prompts, and conversation history all accumulate quickly. In practice, investigations used 4K–8K tokens of context. I configured 16K to leave headroom for longer conversations.
Nemotron-3 Nano 4B is a hybrid architecture with 42 layers, but only 4 of them are transformer attention layers (at positions 12, 17, 24, and 32). The remaining 38 are Mamba2 selective state space layers. The Mamba2 layers maintain a fixed-size state regardless of sequence length, so their memory use does not grow with context.
Measured memory allocation from llama.cpp
KV cache (4 attention layers @ 16K) : 256 MiB (K: 128 MiB, V: 128 MiB)
Mamba2 recurrent state (38 layers) : 324 MiB (fixed; does not grow with context)
Compute buffers : ~318 MiB (GPU + host)A pure transformer with 42 attention layers at the same dimensions would need roughly 2.6 GB of KV cache alone with a 16K context window. That would leave no room for the model weights, let alone an embedding model and re-ranker.
In Q4_K_M quantization, the model file is about 2.8 GB on disk. In practice, that was enough to run a 16K context window together with the embedding model, re-ranker, FAISS index, and Python runtime within the Orin Nano's 8 GB memory budget. I initially tested several other 4B-class models, but they either exhausted memory at longer context lengths or left too little headroom for retrieval and re-ranking. For this class of operational task — following procedures, running shell commands, formatting structured reports — procedural reliability matters more than open-ended reasoning ability.
What I BuiltThree LangGraph ReAct agents connected to Nemotron-3 Nano via llama.cpp:
- Main agent follows an investigation procedure loaded from the field manual. It delegates work to sub-agents, then decides whether to email ops or reboot.
- Log search sub-agent runs shell commands (grep, awk, date) against three log files with different timestamp formats. Computes time cutoffs, searches each file, deduplicates repeated errors.
- Manual consultant sub-agent searches a FAISS-indexed knowledge base with a cross-encoder re-ranker. Returns severity and remediation steps.
The agent framework is standard. The challenge was getting the full stack — LLM, RAG retrieval, re-ranking, and multi-step tool use — to run on an 8 GB edge device with enough context for real investigations.
The Knowledge BaseAll operational knowledge lives in Markdown documents indexed for RAG:
- Field manual — 14 incident sections (thermal throttle, pipeline stall, OOM kill, NVMe errors, etc.), each with severity, log signatures, root causes, and specific remediation commands.
- Hardware spec — Orin Nano thermal envelope, power rails, memory budget, NVPMODEL profiles.
- Deployment guide — systemd services, pipeline configuration, monitoring.
- Known issues — 6 documented platform quirks with workarounds.
Documents are chunked by `##` headings, embedded with fastembed, and indexed with FAISS. At query time, a cross-encoder re-ranks the top candidates. The investigation procedure is loaded from the manual into the main agent's system prompt — updating the manual changes how the agent investigates, no code changes needed.
The SandboxSince the agent runs shell commands against log files, sandboxing was a first-class requirement, not an afterthought. The agent runs inside a bubblewrap (bwrap) sandbox with:
- Read-only filesystem (system libs, agent code, log files, knowledge base)
- PID, IPC, UTS namespace isolation
- Full network isolation (`--unshare-net`) with a socat bridge that forwards only localhost:8080 to the llama-server
- Only writable mount: `output/` for the action log
The synthetic log generator creates 24 hours of realistic Jetson logs with two incidents in the last hour:
- Memory spike (~45 min ago) — GPU memory hits 90%, CUDA unified memory thrashing causes a 340ms latency spike, self-resolves after TensorRT memory pool compaction.
- Thermal throttle cascade (~30 min ago) — a third camera pipeline starts, temperature climbs from 45°C to 73°C in 10 seconds, clocks drop from 918 MHz to 420 MHz, deadline misses cascade, pipeline stalls, camera-03 gets suspended, recovers after 60 seconds.
When you ask `check last hour`, the agent:
- Searches all three log files with time filtering — ISO cutoff for app.log, syslog time comparison for thermal.log, seconds-since-boot arithmetic for dmesg.log
- Groups related errors — thermal throttle + deep throttle + deadline misses + pipeline stall = one incident
- Consults the field manual — retrieves the Thermal Throttle section (Critical severity) with specific remediation steps
- Sends a structured email with the root cause chain, supporting evidence from all three log files, and recommended actions from the manual
- Checks the reboot policy — not needed, the thermal cascade self-recovered
Measured performance on Jetson Orin Nano (from llama.cpp server logs)
Model : Nemotron-3 Nano 4B Q4_K_M
Context window : 16,384 tokens
Generation speed : ~15.5 tokens/sec (consistent across calls)
Prompt eval speed : 135-400 tokens/sec (varies with cache hits)
Log search shell calls : 5-8 seconds each
Log search final summary : ~2.5 minutes (2,423 tokens generated)
Manual lookup : ~16 seconds
Main agent final response : ~48 seconds (722 tokens generated)
Full investigation time : ~6 minutes end to endThe bottleneck is the log-search sub-agent's summarization step, which compresses thousands of raw log lines into a deduplicated report for the main agent. That is slow, but it keeps the rest of the system simple and makes the final diagnosis traceable. For automated monitoring that runs periodically, a few minutes per investigation is acceptable — especially compared with escalating raw logs to a remote operator.
What I LearnedThe Mamba2 hybrid design was the enabling factor. I tried several 4B models before landing on Nemotron-3 Nano. Pure transformers at this size either ran out of memory at 16K context or left insufficient headroom for the RAG pipeline. The reduced KV cache pressure from the Mamba2 layers is what made the full stack fit.
Sub-agents keep the context manageable. A single agent processing raw log output would exhaust the context window. The log search sub-agent sees raw lines, processes them, and returns a compact summary. The main agent only sees the deduplicated result. Each sub-agent's context stays small.
The field manual is the right abstraction. Early versions had behavioral instructions scattered across prompts and tool descriptions. Moving everything to indexed Markdown documents means ops teams can update the runbook — add new error types, change remediation steps — without touching code.
Try ItThe repository includes the synthetic log generator, field manual, sandbox setup, and agent implementation.
Requirements: Jetson Orin Nano (8 GB), NVMe SSD, bubblewrap, socat, llama.cpp. Setup is one command (`make setup`).
The point of this project is not that a 4B model can replace an SRE. It is that, on the edge, a small local model can already do something operationally useful: turn noisy hardware logs into a diagnosis and an action plan without depending on a cloud connection. On Jetson Orin Nano, Nemotron-3 Nano was the first model I tested that made the whole stack fit cleanly enough to be genuinely usable.




Comments