# Running a 122B MoE LLM with Autonomous Tool-Calling on two NVIDIA Jetson Thor T5000
---
## Introduction
What if your AI assistant ran entirely on hardware you owned — no cloud, no API keys, no data leaving your premises? And what if that assistant could autonomously read files, run bash commands, and write code, all driven by a 122-billion parameter model sitting on a single board?
That's what we built with BrainiaK on the NVIDIA Jetson Thor T5000. This article documents exactly how we did it: the model, the stack, the pitfalls, and the architecture decisions that made it work.
We're running Qwen3.5-122B-A10B-AWQ-4bit — a Mixture-of-Experts model with 122B total parameters and 10B active per forward pass — on a single Jetson Thor T5000 with 128 GB of unified memory. On top of that, we've built an autonomous dev agent with real tool-calling (filesystem, shell, search) that runs fully on-device.
No cloud. No external APIs. Fully sovereign.
---
## §1 — Why the Jetson Thor T5000 Changes Everything
The Jetson Thor T5000 is a different kind of edge device. Most edge hardware forces a brutal trade-off: either you run a small model (7B-13B) that fits in limited VRAM, or you offload to the cloud and lose sovereignty. The Thor T5000 breaks that trade-off.
**The key spec: 128 GB of unified memory.**
On a conventional server with discrete GPUs, a 122B AWQ-4bit model (75 GB weights) would require at least one H100 80 GB — and you'd still be fighting CPU↔GPU transfer overhead for the KV cache. On the Jetson Thor, the CPU and GPU share the same physical memory pool. The model weights, KV cache, and all system processes live in one contiguous 128 GB space. No PCIe bottleneck. No transfer latency.
The practical consequence: you can run models that would be impossible on conventional edge hardware, while keeping the power envelope of an edge device (~700W peak vs 10kW+ for a DGX node).
**Our production configuration:**
| Component | Spec |
|-----------|------|
| Platform | NVIDIA Jetson Thor T5000 |
| Memory | 128 GB unified LPDDR5X |
| OS | JetPack 7.1 (Ubuntu 24.04) |
| CUDA | 13.0 |
| Driver | 580 |
| Model | Qwen3.5-122B-A10B-AWQ-4bit |
| Model GPU footprint | ~102 GB (74.67 GB weights + 27.22 GB KV pool @ 0.90 util) |
| Context window | **131 072 tokens** (validated — see §6) |
| Inference speed | ~4.2 tok/s (MTP×3: -30% to -47% latency — see §5) |
| Cold start (model load) | ~10 min (weights 37s + CUDA graphs 9 min with MTP×3) |
With 100 GB used by the model, you still have ~28 GB for the OS, databases (Postgres, NATS), and application services. That's a real production deployment, not a demo.
---
## §2 — Model Deployment: The Pitfalls Nobody Warns You About
Getting Qwen3.5-122B running on Jetson Thor is not straightforward. We hit three major pitfalls. Documenting them here will save you hours.
### Pitfall 1: The AWQ trap
Qwen3.5-122B-A10B-AWQ-4bit uses a quantization format called **compressed-tensors**, not classic AWQ. If you pass `--quantization awq` to vLLM, it will crash with a cryptic error.
**Never do this:**
```bash
vllm serve /path/to/model --quantization awq # WRONG — will crash
```
**Do this instead — let vLLM auto-detect:**
```bash
vllm serve /path/to/model # compressed-tensors auto-detected
```
vLLM will correctly identify `CompressedTensorsWNA16MarlinMoEMethod` and use the Marlin WNA16 4-bit kernel. You'll see this in the logs:
```
Using CompressedTensorsWNA16MarlinMoEMethod for layer...
```
### Pitfall 2: The page cache trap
After stopping a vLLM container that loaded a 75 GB model, the Jetson's unified memory doesn't immediately free that space. The Linux page cache holds the model weights in RAM even after the process exits. On our first restart attempt, we had only 15 GB free out of 128 GB — the previous model run had left 37+ GB in page cache.
**Always run this before starting a new large model:**
```bash
docker stop brainiak-agent-alpha
docker rm brainiak-agent-alpha
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
# Wait a few seconds, then verify:
free -h # should show ~119 GB available
```
This is unique to Jetson's unified memory architecture. On discrete GPU systems, GPU memory is freed immediately when the process exits. On Jetson, it behaves like any Linux process using large amounts of RAM.
### Pitfall 3: The symlink pattern for model management
When you want to switch models (e.g., from Qwen3.5-122B to Qwen3.5-235B when it arrives), rebuilding the container is wasteful. We use a symlink pattern:
```bash
# Create symlink pointing to active model
ln -sfn models--cyankiwi--Qwen3.5-122B-A10B-AWQ-4bit/snapshots/<HASH> \
/home/user/.cache/huggingface/agent-alpha-active
# vLLM serve uses the symlink
vllm serve /data/models/huggingface/agent-alpha-active
```
To switch models: update the symlink, drop caches, restart the container. No Docker rebuild needed.
### Our docker-compose configuration
```yaml
services:
vllm-agent-alpha:
image: ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor
runtime: nvidia
ipc: host
ulimits:
memlock: -1
stack: 67108864
environment:
NVIDIA_VISIBLE_DEVICES: all
volumes:
- ${HF_CACHE:-~/.cache/huggingface}:/data/models/huggingface
command: >
vllm serve /data/models/huggingface/agent-alpha-active
- -max-model-len 131072
- -gpu-memory-utilization 0.90
- -host 0.0.0.0
- -port 8000
- -served-model-name qwen-a qwen-b qwen-copilot qwen-thinking
- -trust-remote-code
- -enable-auto-tool-choice
- -tool-call-parser qwen3_coder
- -speculative-config '{"method":"qwen3_5_mtp", "num_speculative_tokens":3}'
ports:
- "8001:8000"
healthcheck:
test: ["CMD-SHELL", "curl -sf http://localhost:8000/health || exit 1"]
interval: 15s
start_period: 900s
```
Several important flags here that we'll cover in §3, §4, §5, and §6.
---
## §3 — Four Roles, One Model: The served-model-name Architecture
Look at the `--served-model-name` flag in our config:
```
--served-model-name qwen-a qwen-b qwen-copilot qwen-thinking
```
One physical model. Four logical endpoints. This is the core architectural decision that makes BrainiaK work.
### What it does
vLLM exposes a single OpenAI-compatible API but responds to four different model names. Any client that sends `model: "qwen-copilot"` or `model: "qwen-thinking"` reaches the same underlying Qwen3.5-122B weights. The model selection is effectively a routing tag, not a different model load.
This maps directly to our four pipeline phases:
| Alias | Role | Use |
|-------|------|-----|
| `qwen-a` | Intake | Request classification, execution class, prompt machine selection |
| `qwen-copilot` | Copilot | Tool-calling loop, code execution, iterative problem solving |
| `qwen-thinking` | Thinking | Deep reasoning for complex problems (temp=0.6) |
| `qwen-b` | Recompose | Final response synthesis from all prior context |
### Why this matters architecturally
Today these four aliases point to one model — temporal multiplexing with preserved KV cache between phases. Tomorrow, if we need higher throughput or want to specialize each role, we can switch each alias to a different model instance by changing four environment variables:
```bash
BRAINIAK_VLLM_URL_INTAKE=http://jetson1:8001
BRAINIAK_VLLM_URL_COPILOT=http://jetson1:8002
BRAINIAK_VLLM_URL_THINKING=http://jetson1:8003
BRAINIAK_VLLM_URL_RECOMPOSE=http://jetson1:8004
```
Zero code changes. The architecture absorbs both configurations — 1 large model and N specialized models — without modification.
### The pipeline flow
A request to BrainiaK traverses all four phases:
```
User request
│
▼
[Admission Controller] — classify: fast / normal / heavy
│
▼
[Intake — qwen-a] — route to correct Prompt Machine, extract goal
│
▼
[Copilot — qwen-copilot] — tool-calling loop (if tools needed)
│ └─ execute_python / bash / read_file /...
▼
[Thinking — qwen-thinking] — deep reasoning (heavy requests only)
│
▼
[Recompose — qwen-b] — synthesize final response
│
▼
[Memory] — persist interaction to Postgres
```
The KV cache advantage: since all four aliases are the same physical model, the context from the intake phase is already in the KV cache when the copilot phase starts. The model doesn't re-read the original request — it continues from where it left off.
---
## §4 — Autonomous Tool-Calling: The qwen3_coder Discovery
This is the section that cost us the most time to figure out, and will save you the most.
### Enabling function calling on Jetson
vLLM's OpenAI-compatible API supports function/tool calling, but you need two specific flags:
```
- -enable-auto-tool-choice
- -tool-call-parser qwen3_coder
```
The first flag enables automatic tool choice (the model decides when to call a tool). The second is the critical one.
### Why qwen3_coder and not hermes
Most vLLM tool-calling tutorials use `--tool-call-parser hermes`. Hermes expects tool calls in standard JSON format:
```json
{"name": "function_name", "arguments": {"key": "value"}}
```
Qwen3.5 doesn't do that. It uses a completely different format:
```
<tool_call>
<function=list_dir><parameter=path>/app</parameter></function>
</tool_call>
```
If you use `--tool-call-parser hermes` with Qwen3.5, the model's tool calls appear as raw text in the response `content` field — no structured `tool_calls` array. The client receives what looks like a normal text response with XML-like tags buried in it. Extremely confusing to debug.
We discovered the correct parser by listing available parsers directly from the vLLM source:
```python
from vllm.entrypoints.openai.tool_parsers import ToolParserManager
print(list(ToolParserManager.lazy_parsers.keys()))
# ['granite-20b-fc', 'hermes', 'internlm', 'jamba', 'llama3_json',
# 'mistral', 'pythonic', 'toolace', 'xlam', 'qwen3_coder',...]
```
`qwen3_coder` — exactly what Qwen3.5 needs.
### The tool-calling loop in action
With the correct parser, the model correctly populates `tool_calls` in the response. Our BrainiaK dev agent implements a simple iterative loop:
```python
async def _run_tool_loop(self, messages, scope="dev", max_turns=10):
tools = self._registry.get_definitions(scope)
conversation = list(messages)
for turn in range(max_turns):
result = await client.chat_with_tools(
model="qwen-copilot",
messages=conversation,
tools=tools,
)
if not result.tool_calls:
return result.content # Direct answer, done
# Execute all tool calls
for call in result.tool_calls:
tool_result = await self._registry.execute(
call.function_name,
call.arguments,
)
conversation.append({
"role": "tool",
"tool_call_id": call.id,
"content": tool_result.output,
})
return conversation[-1]["content"]
```
### What the dev agent can do
We expose 7 tools through the registry:
| Tool | Description |
|------|-------------|
| `read_file` | Read any file by absolute path |
| `write_file` | Create or overwrite a file |
| `edit_file` | Replace specific string in a file |
| `list_dir` | List directory contents |
| `glob_files` | Find files by pattern |
| `grep_files` | Search file content by regex |
| `bash` | Execute shell commands (30s timeout) |
A developer sends a request to `POST /v0/dev/chat`:
```bash
curl -X POST http://localhost:8080/v0/dev/chat \
- H 'Content-Type: application/json' \
- d '{
"messages": [{
"role": "user",
"content": "List the Python files in /app and show me the contents of pyproject.toml"
}]
}'
```
The Qwen3.5-122B model running on Jetson:
1. Decides to call `list_dir("/app")` — sees `brainiak/`, `core/`, `pyproject.toml`
2. Decides to call `read_file("/app/pyproject.toml")` — reads the file
3. Synthesizes a coherent response with both results
All of this happens on-device. The model reasons, selects tools, reads files, and responds — autonomously, with no cloud dependency.
```json
{
"response": "The /app directory contains: brainiak/, core/, pyproject.toml...\n\nContents of pyproject.toml:\n[project]\nname = \"brainiak\"...",
"tools_used": 2
}
```
### Architecture isolation
One design decision worth highlighting: the dev agent uses a **separate AgentAlpha instance** from the main pipeline singleton. The pipeline's `intake/thinking/recompose` modes don't see the dev tools. The tool registry uses scope filtering — `scope="dev"` returns dev tools, `scope="sandbox"` returns sandboxed Python execution tools. Contamination between pipeline modes is architecturally impossible.
---
## §5 — Free Speedup: MTP Speculative Decoding on Qwen3.5
After validating the dev agent, we benchmarked it and found latency was the main bottleneck: 14-20 seconds per tool call turn on the 122B. Good enough for batch pipelines, but too slow for interactive use.
The solution came from an unexpected place: the model itself.
### The model has built-in draft heads
Inspecting the model's `model.safetensors.index.json` reveals something most people miss:
```
"mtp.layers.0.self_attn.q_proj":...,
"mtp.layers.0.mlp.shared_expert.gate_proj":...,
"mtp_num_hidden_layers": 1
```
Qwen3.5-122B-A10B ships with **Multi-Token Prediction (MTP) layers** — additional transformer blocks trained to predict multiple future tokens in parallel. This is speculative decoding built into the model, requiring no separate draft model, no extra download, no additional memory.
### What MTP layers actually are
An MTP layer is a full transformer block (self-attention + MLP) that runs on the main model's hidden states and proposes N candidate tokens ahead. The main model validates them in one forward pass — accepting correct predictions and correcting wrong ones.
Think of it as a specialized sub-agent embedded in the model:
- **Main model** (48 layers, 10B active params) = the "verifier"
- **MTP layer** (1 transformer block) = the "drafter"
- The drafter proposes, the verifier accepts or corrects
This is structurally identical to how BrainiaK's nodal architecture works — fast specialized nodes propose, a higher-level node validates. The boundary between "one model" and "multiple agents" is architectural, not fundamental.
### Enabling MTP in vLLM
vLLM 0.16.0rc2 (the NVIDIA AI IOT Jetson nightly build) supports MTP via the `--speculative-config` flag. The method `qwen3_5_mtp` is specifically implemented for this model family:
```yaml
command: >
vllm serve /data/models/huggingface/agent-alpha-active
- -max-model-len 131072
- -gpu-memory-utilization 0.90
- -served-model-name qwen-a qwen-b qwen-copilot qwen-thinking
- -trust-remote-code
- -enable-auto-tool-choice
- -tool-call-parser qwen3_coder
- -speculative-config '{"method":"qwen3_5_mtp", "num_speculative_tokens":3}'
```
One line added for MTP. No model download. No memory overhead. (The `0.90` and `131072` values come from §6 — context expansion discovered after the MTP work.)
You can discover available speculative methods in your vLLM build with:
```python
from vllm.config import SpeculativeConfig
import dataclasses
for f in dataclasses.fields(SpeculativeConfig):
if f.name == 'method':
print(f.type)
# Lists: ngram, eagle, medusa, qwen3_5_mtp, qwen3_next_mtp, deepseek_mtp,...
```
### Benchmark results: baseline vs MTP×1 vs MTP×3
We measured wall-clock latency on five dev agent scenarios (system prompt + tool execution + synthesis):
| Scenario | Baseline | MTP×1 | MTP×3 | Total gain |
|----------|----------|-------|-------|------------|
| list_dir (1 tool call) | 14.6s | 12.4s | **10.7s** | **-27%** |
| list + read_file (2 calls) | 40.6s | 30.0s | **21.6s** | **-47%** |
| bash command (1 call) | 9.0s | 7.4s | **6.1s** | **-32%** |
| grep + read_file (2 calls) | 27.1s | 21.8s | **17.3s** | **-36%** |
| Direct answer (no tools) | 18.6s | 14.1s | **12.6s** | **-32%** |
**MTP×3 delivers -30% to -47% latency reduction with zero cost.** The two-tool scenarios benefit most because each tool call cycle gains from speculative prediction of the tool call format (predictable, repetitive tokens that the MTP head learns well).
### Why `num_speculative_tokens=3` is the sweet spot
The model has `mtp_num_hidden_layers: 1` — one MTP layer. With 3 speculative tokens, we're asking the MTP head to propose 3 tokens per draft cycle. On tool call patterns like `<tool_call><function=list_dir>`, token sequences are highly predictable and acceptance rates are high.
Going beyond 3 hits diminishing returns: longer drafts increase the probability of a rejection, which wastes the draft computation. We found 3 optimal without further tuning.
### Startup confirmation
On startup, vLLM logs confirm MTP activation:
```
Resolved architecture: Qwen3_5MoeMTP ← not just Qwen3_5Moe
Initializing a V1 LLM engine with config: speculative_config=SpeculativeConfig(method='mtp', num_spec_tokens=3)
```
The V1 engine is required for MTP and is auto-selected when speculative config is provided.
---
## §6 — Breaking the 8K Wall: 131 072 Tokens on a Single Jetson
After MTP, the next experiment was context length. Our initial config used `--max-model-len 8192`. That's vLLM's conservative default — not a hardware limit. On a Jetson Thor T5000 with 128 GB, the actual ceiling is dramatically higher.
### Why 8K felt like a real constraint
At 8 192 tokens, a dev agent session hits the limit fast. A system prompt, a few conversation turns, and two file reads later — you're at 6 000 tokens. Reading a third file truncates the beginning of the conversation. The model loses context on what it already analyzed. Cross-file reasoning becomes impossible.
For a general-purpose dev assistant, 8K isn't a context window — it's a constraint that shapes every design decision around it.
### The KV cache math
The Qwen3.5-122B-A10B model uses Grouped Query Attention (GQA) with only **4 KV heads** — versus 64 query heads. This is the key insight: MoE models are extremely KV-efficient.
KV cost per token: `94 layers × 4 KV heads × 128 dim × 2 (K+V) × bf16 = 188 KB/token`
At startup, vLLM reports its allocated KV cache pool. At `gpu_memory_utilization=0.85`:
```
Available KV cache memory: 21.67 GiB
```
21.67 GiB ÷ 188 KB = **~120 000 tokens of KV capacity**. With `max_model_len=8192`, we were using 6 GB of a 21 GB pool.
### Testing the ceiling incrementally
We tested in three steps the same evening, without changing the model or the hardware:
| Step | `max_model_len` | `gpu_memory_utilization` | KV pool | Result |
|------|-----------------|--------------------------|---------|--------|
| 1 | 32 768 | 0.85 | 21.67 GiB | ✓ healthy |
| 2 | 65 536 | 0.85 | 21.41 GiB | ✓ healthy |
| 3 | **131 072** | **0.90** | **27.22 GiB** | **✓ healthy** |
The KV pool barely changes between 32K and 65K — because `max_model_len` doesn't allocate KV memory, it just caps the maximum sequence length. The pool size is determined by `gpu_memory_utilization`. Bumping from 0.85 to 0.90 freed an extra 5.8 GiB, enough to cover the 131K requirement (131 072 × 188 KB = 24.7 GiB).
A crucial observation: **CUDA graphs are independent of context length**. The same 49 graphs with batch sizes 1–512 are captured regardless of whether `max_model_len` is 8K or 131K. Startup time stays the same.
### vLLM startup confirmation
```
INFO [model.py] Using max model len 131072
INFO [gpu_worker.py] Available KV cache memory: 27.22 GiB
INFO Application startup complete.
```
No OOM. No warnings about insufficient KV blocks. The model accepted the full 131K window at 0.90 utilization.
### What 131K enables: a live test
To validate the real-world impact, we sent a single query to the dev agent asking it to simultaneously read five core source files (agent_alpha.py, pipeline.py, nodes.py, runner.py, batch.py — 65 KB total) and produce a cross-file analysis:
> *"Trace the complete execution path of a user request from pipeline.py through all AgentAlpha modes to final response. Where exactly does ExecutionClass influence routing? Is there a latency bottleneck between thinking and recompose? Cite line numbers."*
**Results:**
- Tool calls: 6 (5 file reads + 1 synthesis)
- Response: 21 934 characters
- Response excerpt:
```
## (2) Where ExecutionClass Influences Routing
### Location A: node_arbiter (nodes.py lines 124-134)
exec_final = arbitrate(state["ir"], state["request"])
bellman = bellman_route(exec_final, recs)
if bellman.exec_class_adjusted != exec_final:
exec_final = bellman.exec_class_adjusted # ← modified here
### Location B: _route_after_retrieve() (graph.py lines 62, 71-72, 78-79)
is_heavy = state.get("exec_class_final") == ExecutionClass.heavy
if is_heavy:
return "agentic" # lines 71-72
### Location C: agent.intake() (agent_alpha.py lines 295-298)
exec_class_suggested = ExecutionClass(parsed.get("execution_class",...))
## (3) Latency Bottleneck Between Thinking and Recompose
Yes — sequential execution, no parallelization possible.
node_thinking starts: nodes.py line 310
Context injection into recompose: agent_alpha.py lines 652-653
Total latency is additive: thinking_latency + recompose_latency
```
The model correctly identified three distinct ExecutionClass routing points with precise line numbers, spanning four different files — reasoning that requires holding the entire codebase cross-section in context simultaneously.
**At 8K tokens, this query was impossible.** At 131K, it runs in a single call.
### The qualitative shift for a dev assistant
| Context | What's possible |
|---------|-----------------|
| 8 192 | 1-2 files, short conversation, simple lookups |
| 32 768 | ~8 files, standard dev session |
| 131 072 | Entire small-to-medium codebase simultaneously, architectural analysis, cross-module refactoring with full awareness |
A codebase of 150 Python files (~200K tokens uncompressed) can be loaded almost entirely into a 131K context after stripping comments and docstrings. For real bureau d'études workloads — design documents, specs, implementation files, test suites all loaded simultaneously — 131K is the threshold where the assistant stops asking "which file do you want me to read?" and starts actually understanding the system.
---
## §7 — MathCore: Closed-Loop Control Over Your LLM Pipeline
Deploying a 122B model on edge hardware is only half the problem. The other half: **how do you know when it starts degrading?**
LLM pipelines fail silently. Latency drifts up by 20%, then 40%. One node becomes a bottleneck. The model starts timing out on complex requests. Without a monitoring and control layer, you only notice when users complain.
BrainiaK includes MathCore — a closed-loop control system that observes the pipeline, detects drift, and adjusts routing decisions autonomously.
### Real telemetry from production
Every pipeline execution emits `GRAPH_NODE_DONE` telemetry events. After running BrainiaK in production, the per-node latency profile looks like this:
```sql
SELECT node_id, COUNT(*) as events,
ROUND(AVG(latency_ms)) as avg_latency_ms,
ROUND(MIN(latency_ms)) as min_ms,
ROUND(MAX(latency_ms)) as max_ms
FROM telemetry_event
WHERE event_type = 'GRAPH_NODE_DONE'
GROUP BY node_id ORDER BY node_id;
```
```
node_id | events | avg_latency_ms | min_ms | max_ms
--------------+--------+----------------+--------+--------
arbiter | 13 | 1 | 0 | 5
intake | 19 | 11980 | 17 | 84492
recompose | 11 | 42158 | 46 | 120018
retrieve | 13 | 5 | 1 | 8
thinking | 1 | 54 | 54 | 54
write_memory | 10 | 3 | 1 | 5
```
This table tells the whole story immediately:
- **arbiter, retrieve, write_memory**: 1-5ms — these nodes are never the bottleneck
- **intake**: 11, 980ms average — LLM classification, cold/warm variance is huge (17ms → 84s)
- **recompose**: 42, 158ms average — LLM synthesis, the dominant cost of the pipeline
- **thinking**: 54ms, 1 sample — almost never triggered (heavy requests only)
After enabling MTP×3 speculative decoding (§5), intake dropped to ~8s average and recompose to ~25-28s. MathCore measures this improvement automatically.
### The control loop architecture
```
Every request
│
▼
GRAPH_NODE_DONE event (node_id, latency_ms, error=bool, tenant_id)
│
▼ nightly (00:05)
DailyAggregator
→ daily_aggregate(node_id, date, avg_latency_ms, error_rate, sample_count)
│
▼ nightly (after aggregation)
MathCore Pipeline
├── FFS (Functional Feature Synthesis)
│ FPCA via SVD on time-series of daily_aggregates
│ → functional representation of each node's behavior curve
│
├── MixMod (Mixture Models)
│ GMM-EM on FPCA scores
│ → cluster assignment (normal / degraded / outlier)
│
├── Drift Detector
│ Mahalanobis distance from baseline cluster
│ → drift_score per node (0.0 = nominal, >1.0 = drifting)
│
└── Recommender
weight_adjustment = 0.6 × live_penalty + 0.4 × drift_score
→ NodeWeightAdjustments per node
│
▼
Bellman DP Router
backward induction on DAG with adjusted costs
→ optimal path changes if c(thinking) > 2 × c(recompose)
→ heavy requests downgraded to normal automatically
```
### What happens when a node drifts
The Bellman DP router computes the optimal path through the DAG at each request using current node costs. If the `thinking` node's drift score increases (e.g., after a model update or hardware thermal event), the router automatically downgrades `heavy` execution class to `normal` — skipping the thinking node entirely.
This happens without any human intervention. The system measures → detects → adapts.
### Nightly batch orchestrator
A single cron entry drives the entire closed-loop cycle:
```bash
# /etc/cron.d/brainiak — runs at 00:05 every night
5 0 * * * jeanphi bash ~/brainiak/scripts/nightly.sh
```
```bash
# nightly.sh
curl -sf -X POST http://localhost:8080/v0/admin/batch/nightly \
- H 'Content-Type: application/json' \
- d '{"tenant_id": "..."}'
```
The batch endpoint runs the full DAG: daily aggregation → weekly aggregation → MathCore pipeline → warmup of the NodeMetricRegistry. The result:
```json
{
"batch_id": "nightly-2026-03-04",
"daily": {"events_processed": 29, "node_count": 6, "tenant_count": 1},
"weekly": {"status": "ok"},
"pipeline": {"status": "ok", "nodes_analyzed": 6},
"warmup": {"samples": 61, "nodes": 6}
}
```
The warmup step pre-loads 7 days of daily aggregates into the NodeMetricRegistry at startup — so Bellman DP has baseline costs available from the first request after a restart.
### Why this matters at the edge
Cloud LLM providers have entire observability teams, A/B testing infrastructure, and canary deployments. On-premise edge deployments have none of that. MathCore brings the same closed-loop intelligence to a single board.
The combination with MTP speculative decoding closes the loop further: MTP reduces latency → daily aggregates show improved node metrics → Bellman DP can be less conservative about routing heavy requests → quality improves. The system self-optimizes.
> **Note on data requirements**: FPCA (Functional Principal Component Analysis) needs a time-series of daily aggregates — typically 14-30 days for meaningful decomposition. On a fresh deployment, MathCore reports `insufficient_data` and the Bellman DP falls back to static costs. Full drift detection activates automatically as data accumulates.
---
## §8 — The Full Architecture: Two Jetsons, One Brain
BrainiaK is not a single Jetson. It's two Jetson Thor T5000 boards connected by a direct 5 Gbps Ethernet link, forming a single nervous system where compute is concentrated but distributed by role.
**Jetson 1 — The Brain (10.0.0.1)**
One physical model: Qwen3.5-122B-A10B-AWQ-4bit (~102 GB GPU). Four logical roles temporally multiplexed via `--served-model-name`, with MTP×3 speculative decoding — the exact same configuration shown in §2:
```yaml
command: >
vllm serve /data/models/huggingface/agent-alpha-active
- -max-model-len 131072
- -gpu-memory-utilization 0.90
- -served-model-name qwen-a qwen-b qwen-copilot qwen-thinking
- -trust-remote-code
- -enable-auto-tool-choice
- -tool-call-parser qwen3_coder
- -speculative-config '{"method":"qwen3_5_mtp", "num_speculative_tokens":3}'
```
Exposed port: `8001`. The four logical aliases map to the DAG pipeline phases:
- `qwen-a`: Intake (request classification, Prompt Machine selection)
- `qwen-copilot`: Copilot (tool-calling loop, code execution)
- `qwen-thinking`: Thinking (deep reasoning, verification)
- `qwen-b`: Recompose (final synthesis, output formatting)
**Jetson 2 — The Polymorphic Hive (10.0.0.2)**
A smaller base model (Qwen3.5-9B, ~19 GB) serves five specialized experts via pre-loaded LoRA adapters:
```bash
vllm serve /data/models/huggingface/qwen35-9b-active \
- -max-model-len 8192 \
- -gpu-memory-utilization 0.85 \
- -host 0.0.0.0 --port 8765 \
- -served-model-name base \
- -trust-remote-code \
- -enforce-eager \
- -enable-lora --max-lora-rank 32 \
- -lora-modules \
oracle_v5=/adapters/oracle_v5 \
critic=/adapters/critic \
code=/adapters/code \
math=/adapters/math
```
Five LoRA adapters pre-loaded at startup — **no dynamic endpoint** for loading/unloading. Routing to a specific expert is done via the `model` field in the JSON payload:
```json
{"model": "critic", "messages": [...]} // Post-response verification
{"model": "oracle_v5", "messages": [...]} // MathCore regime classification
{"model": "code", "messages": [...]} // Code generation
{"model": "math", "messages": [...]} // Mathematical reasoning
{"model": "base", "messages": [...]} // Chat, summarization, vision
```
LoRA switch time: <50ms via vLLM's native adapter mechanism.
Qwen3.5-9B is natively multimodal (early fusion — no separate ViT encoder). The base model on Jetson 2 handles both text and vision tasks without requiring an additional vision-specific model.
**Direct 5 Gbps Ethernet Link**
Jetson 1 (10.0.0.1) ↔ Jetson 2 (10.0.0.2) via direct Ethernet cable. Latency <1ms. No NAT, no intermediate firewall. The Core Runtime on Jetson 1 calls the Hive on Jetson 2 via `http://10.0.0.2:8765/v1/chat/completions`.
**Composite Memory: L0 → L1 → L2 → L3**
| Layer | Storage | Function |
|-------|---------|----------|
| L0 | Hard sliding window (last 20 messages) | Active LLM context — everything older is evicted |
| L1 | Postgres 16 (conversation segments + keyword index) | LLM-summarized segments, full-text search |
| L2 | Qdrant (vector search, BGE-M3 embeddings planned) | Semantic retrieval (integration in progress) |
| L3 | brainiak-memory git repo (SSH deploy key) | Long-term crystallization, full-text archive |
The eviction pipeline runs asynchronously: when L0 overflows, evicted messages are summarized by Qwen3.5-9B on Jetson 2, stored as keyword-indexed segments in Postgres L1, and archived to git L3. The hippocampus retrieves relevant past context via L1 keyword search — no need to keep raw history in the active window.
**Symmetric Tool Hub**
One Tool Hub per Jetson, with cross-node HTTP forwarding:
- Jetson 1: Tool Hub (filesystem, bash, git, Docker, GPU stats, Postgres queries)
- Jetson 2: Tool Hub (Qdrant search, Hive vLLM health, sandboxed execution)
- Forwarding: each Tool Hub knows its peer URL and routes requests for remote tools via HTTP
**Hippocampal Grounding Gate**
Before every inference, the system prompt is automatically enriched with:
- Current system configuration (ports, active LoRA adapters, service status)
- MathCore signal (if available: regime, λ_max, α)
- Relevant L1 memory segments (retrieved via Postgres keyword search)
This gate prevents configuration hallucination — the model always "knows" what physical state it's running in.
---
## §9 — What's Next
**MathCore Phase 1 — Automatic Activation at J+7**
The MathCore pipeline is operational but in observation mode (CI gate: 7 days of data required). After the observation period, automatic activation:
- DailyAggregator → WeeklyAggregator → FFS/MFPCA → MixMod → Drift Detection → Recommender
- Bellman DP routing: dynamic node cost optimization based on collected metrics
- Synaptic signal injected into system prompt: `[MathCore | regime: <label> | λ_max: <value> | α: <value>]`
**Specialized LoRA: OCR for French Technical Documents**
Fine-tuning a specialized OCR LoRA on a corpus of 1, 000 French energy performance certificates (DPE):
- Teacher labeling via Qwen3.5-122B (structured data extraction)
- Base model: Qwen3.5-9B on Jetson 2 (natively multimodal — early fusion handles images directly)
- Goal: OCR + semantic understanding of complex French technical documents (tables, floor plans, energy labels)
**BGE-M3 Embeddings for L2 Qdrant**
Replacing basic retrieval with BGE-M3 (multilingual, 1024 dimensions):
- Better performance on French and code
- Integration into the L1 → L2 retrieval pipeline
- Dedicated Qdrant collection: `brainiak_memory_v2`
**Monthly Crystallization (L4)**
SFT dataset generated from validated L0 → L1 → L2 → L3 interactions:
- Filtering: only responses where critic verdict = PASS
- Monthly fine-tune of the 122B with newly acquired competencies
- Goal: crystallize reasoning and tool-use patterns into the model weights
**Scaling — 8-16 Jetson Thor Cluster**
Extending the architecture to a cluster of 8-16 Jetson Thor T5000 boards:
- Physical network topology mirrors the Gamma externality matrix from our theoretical framework
- Each Jetson becomes a specialized Hive node (OCR, code, math, memory, etc.)
- Orchestration via NATS JetStream + DAG Engine
- The system already computes and exposes its own Gamma matrix (inter-node correlations) — as the cluster grows, these correlations become the substrate for emergent collective behavior
- --
*Jean-Philippe Garnier — March 2026*
---
## Resources





Comments