Published March 6, 2026 © CC BY

Claude at Home ( Agentic System on Jetson Thor T5000)

How to build Agentic Hive founded on the New General Equilibrium Theory of Orchestred Agents on NVIDIA Jetson Thor T5000 (Part 1)

ExpertWork in progress4 days269

Claude at Home ( Agentic System on Jetson Thor T5000)

Things used in this project

Hardware components

Nvidia jetson AGX Thor T5000 Developper Kit

Software apps and online services

vLLM 0.16.0rc2

Qwen3.5-122B-A10B-AWQ

Docker / Docker Compose

FastAPI

LangGraph

PostgreSQL 16

Qdrant

JetPack 7.1 (Ubuntu 22.04 ARM64)

NATS JetStream

Story

*How we deployed a full orchestrated knowledge pipeline with a 122B parameter model on a single edge device, no cloud required.*

This is the first part of BrainiaK project: BrainiaK is an Agentic Hive able to pilot large number of agents with born / death dynamic process funded on the New General Esuilibrium Theorie of Orchestred Agents. This théorie gives strong topological fundations to write agent optimisation problems in new Univers : Agentic System Univers (ASU) . In this univers hardware constraints / agent optimisation problems/ user tasks / orchestrator management problems / externalities production effects etc ….are written in term of optimal control problem…. a new paradigm arrives : the one of « Task Production Economy »

Brainiak is the first Agentic System that integers all of that with Mathcore Runtime layer !

It is the first part of long story of création of the first Agentic System build to use this theory

---

## Why This Matters

When NVIDIA launched the Jetson Thor T5000 in mid-2025, the embedded AI world changed quietly but profoundly. For the first time, a single edge device ships with **128 GB of unified LPDDR5X memory** — enough to run a 122B parameter model in AWQ-4bit quantization entirely in local memory, without splitting across devices, without cloud inference, without data leaving the premises.

Most published work on Thor T5000 focuses on robotics: humanoids, autonomous vehicles, sensor fusion. That makes sense — it's what NVIDIA markets. But we went in a different direction.

We built **BrainiaK**: a full agentic knowledge pipeline running on a single Thor T5000, serving as a personal cognitive assistant and development agent. Multi-mode routing, composite memory, tool execution, and a closed-loop self-optimization system called MathCore. All on-premise. All sovereign.

This article covers what it actually takes to get there.

---

## Hardware Setup

**The Jetson AGX Thor T5000 developer kit:**

| Spec | Value |

|---|---|

| GPU | NVIDIA Blackwell (B-class) |

| AI Performance | 2, 070 TOPS (FP4) |

| Memory | 128 GB LPDDR5X unified (CPU + GPU) |

| Storage | 1 TB NVMe (PCIe Gen5) |

| Network | 10 GbE onboard |

| Power | 40–130 W TDP |

| OS | JetPack 7.1 (Ubuntu 22.04 ARM64) |

The unified memory architecture is the key insight. On a discrete GPU setup, you pay a PCIe transfer cost every time data moves between system RAM and VRAM. On Thor T5000, the CPU and GPU share the same physical memory pool. For LLM inference, this means the model weights live once, accessible by both compute units simultaneously.

**Model we're running:**

```

Qwen3.5-122B-A10B-AWQ (4-bit quantized)

Model size on disk : ~65 GB

GPU memory at load : ~72 GB (leaves ~55 GB for KV cache + system)

Throughput measured : ~13 tokens/second

```

We also tested Qwen3.5-235B-A22B-AWQ which fits within the 128 GB envelope — ~116 GB loaded. Throughput drops to ~6-7 tok/s but quality is exceptional for deep reasoning tasks.

---

## Serving with vLLM on JetPack 7.1

vLLM 0.16.0rc2 is the first version with solid Jetson Thor support. Getting it running on ARM64 JetPack 7.1 has specific gotchas.

**Installation:**

```bash

# JetPack 7.1 ships CUDA 13.0, Driver 580

# vLLM nightly for Jetson (ARM64 wheel)

pip install vllm==0.16.0rc2 \

--extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v71

# Verify GPU is visible

python3 -c "import torch; print(torch.cuda.get_device_name(0))"

# Expected: NVIDIA Jetson Thor

```

**Launch command:**

```bash

vllm serve Qwen/Qwen3.5-122B-A10B-Instruct-AWQ \

-served-model-name qwen-a \
-host 0.0.0.0 \
-port 8001 \
-gpu-memory-utilization 0.85 \
-max-model-len 32768 \
-tool-call-parser qwen3_coder \
-enable-auto-tool-choice \
-trust-remote-code

```

**Pitfalls we hit:**

1. **safe.directory git error** when the container runs as root but the repo is owned by another user — fix: `git config --global --add safe.directory /workspace`

2. **USB-C confusion** — the T5000 dev kit has two USB-C ports. The one near USB-A (labeled 5a) is data/recovery. The one near HDMI (5b) is power. Confuse them and the board won't boot.

3. **nvpmodel default mode** — JetPack boots in a balanced power mode. Switch to MAXN for maximum performance:

```bash

sudo nvpmodel -m 0 && sudo jetson_clocks

```

4. **Swap** — disable it. With 128 GB unified memory you don't need swap, and it causes latency spikes:

```bash

sudo swapoff -a

```

---

## The Pipeline Architecture

A chatbot answers. BrainiaK routes, orchestrates, and adapts. Here is the full architecture:

```

**Three execution modes:**

|-----------|---------|-------------------|--------------|

| Fast | Simple queries, short content | 4, 000 | ~5-15s |

| Normal | Standard analysis, medium depth | 32, 000 | ~15-60s |

| Heavy (Thinking) | Complex reasoning, deep synthesis | 32, 000 | ~2-10 min |

The Arbiter node decides the mode based on content length, explicit tags, and MathCore recommendations. It can upgrade or downgrade a suggested class:

```python

# Upgrade fast→normal if content too long

if suggested == ExecutionClass.fast and content_len > 500:

return ExecutionClass.normal

# Upgrade fast→normal if token budget too high for fast lane

if suggested == ExecutionClass.fast and max_tokens > 4000:

return ExecutionClass.normal

```

The Thinking mode uses Qwen3.5's native chain-of-thought — the model reasons internally in `<think>...</think>` blocks before producing its final answer. We strip the thinking trace from the stored response but log the latency.

---

## Tool Registry — The Agentic Layer

This is where BrainiaK diverges from "chatbot in a box." The Dev Agent node has access to 7 tools:

```python

TOOLS = [

"read_file", # Read any file with line-numbered output

"write_file", # Create or overwrite files

"edit_file", # Surgical string replacement in files

"list_dir", # Directory listing (ls with metadata)

"glob_files", # Pattern matching across the repo

"grep_files", # Ripgrep-powered content search

"bash", # Shell execution (timeout 30s, output 50KB)

]

```

The git workspace is mounted at `/workspace` inside the container. BrainiaK can read, edit, commit and push code autonomously:

```bash

# BrainiaK executing a git commit from inside Docker

bash("git -C /workspace add. && \

git -C /workspace commit -m '[fix] correct FPCA window' && \

git -C /workspace push")

```

This is exposed via `POST /v0/dev/chat` — a multi-turn conversation endpoint where the model maintains tool call history across turns. The model decides autonomously when to call tools and when to respond directly.

**Real example of what BrainiaK does in a single /v0/dev/chat session:**

1. `grep_files("MAX_TOKENS")` — finds all occurrences across the repo

2. `read_file("brainiak/mathcore/config.py")` — reads the config

3. `edit_file(...)` — changes the value

4. `read_file("tests/test_recommender.py")` — checks the test

5. `bash("python -m pytest tests/ -q")` — runs the test suite

6. `bash("git -C /workspace commit -m '...'")` — commits

All from a single natural language instruction. No human in the loop.

---

## Composite Memory

One of the most underrated components. Three layers working together:

```

Request comes in -> :

[BM25 lexical search] <-- fast exact match

[Qdrant vector search] <-- semantic similarity

[PostgreSQL hot memory] <-- recent interactions

[Memory Gate] -- deduplicates, ranks, injects into context

[LLM sees relevant past context]

```

The memory is tenant-isolated at the DB level. Every completed request stores:

Query (up to 2, 000 chars)
Response summary (up to 6, 000 chars)
Execution class used
Latency
Tools called

On the next request, the pipeline retrieves the 7 most relevant past interactions and injects them as context. The model remembers what it did last week.

---

## Real Performance Numbers

Measured on Jetson Thor T5000, JetPack 7.1, Qwen3.5-122B-AWQ:

```

Mode | Avg Latency | Tokens/s | Typical use

--------------+-------------+----------+------------------

Fast | 8-15s | ~13 | Quick questions

Normal | 20-60s | ~13 | Analysis, code

Thinking | 2-8 min | ~13 | Deep synthesis

Dev Agent | 1-3 min | ~13 | Multi-tool tasks

PDF Translate | ~24s/batch | ~13 | 10-block batches

```

The throughput is consistent across modes — 13 tok/s is the hardware ceiling for 122B AWQ-4bit on this device. The latency difference between modes is entirely driven by output length and thinking trace length.

**HTTP client timeout**: set to 3, 600 seconds. At 13 tok/s with 32, 000 max tokens, a full-budget response takes ~41 minutes. The pipeline is fully async — `POST /v0/request` returns 202 immediately, the client polls for results.

---

## Docker Deployment

The entire stack runs in Docker Compose: PostgreSQL 16, NATS JetStream, and the BrainiaK Core service.

```yaml

# docker-compose.yml (excerpt)

services:

brainiak-core:

build:..

extra_hosts:

- "host.docker.internal:host-gateway"

environment:

BRAINIAK_VLLM_AGENT_ALPHA_URL: http://host.docker.internal:8001

volumes:

../brainiak.md:/app/brainiak.md:ro # personality (live reload)
../session_memory.md:/app/session_memory.md:ro
..:/workspace # full repo for git ops
~/.ssh:/root/.ssh:ro # git push from container
~/.gitconfig:/root/.gitconfig:ro

```

vLLM runs on the host directly (not in Docker) to maintain direct GPU access. The BrainiaK Core container calls it via `host.docker.internal:8001`.

**Dockerfile essentials for ARM64:**

```dockerfile

FROM python:3.11-slim

RUN apt-get update && apt-get install -y \

libpq5 git openssh-client \

&& rm -rf /var/lib/apt/lists/*

```

git and openssh-client are required for the Tool Registry to execute git operations from inside the container.

---

## MathCore — The Closed-Loop Layer (Preview)

This is what differentiates BrainiaK from any other edge deployment.

Every request emits a telemetry event: node ID, latency, token counts, outcome. After 7 days of production data, a nightly pipeline activates and analyzes these distributions. It identifies behavioral regimes, detects when a node starts drifting from its historical envelope, and feeds recommendations back into the Arbiter — adjusting token budgets and routing weights for the next day.

```

Production traffic

-> Telemetry aggregation (nightly)
-> Statistical analysis of latency distributions
-> Behavioral regime detection
-> Drift alerts
-> Recommender
-> back into routing decisions (next day)

```

The pipeline reconfigures itself based on its own execution history. No human in the loop.

The mathematical framework behind MathCore is grounded in formal theory — multi-agent general equilibrium, functional data analysis, and dynamical systems applied to agentic architectures. The first preprint is available on arXiv:

**[arXiv:2602.21255](https://arxiv.org/abs/2602.21255)**.

MathCore is the implementation of that theory running live on this hardware. Full methodology papers are forthcoming (NeurIPS/ICML target).

As of publishing: **day 1/7** of the activation window. Follow-up article incoming.

---

## What's Next

**Qwen3.5-235B-A22B-AWQ** — fits in 128 GB, ~6-7 tok/s. Already tested.
**MathCore activation** (day 7) — closed-loop token budget optimization
**Jetson 2 (L'Atelier)** — OCR (Qwen3-VL-8B) + embeddings, 25 GbE direct link
**Hive** — 4-16 Jetson Thor nodes, InfiniBand HDR, physical implementation of the distributed intelligence topology from our forthcoming papers
--

## Key Takeaways

1. **128 GB unified memory changes the edge AI equation** — 122B models are now a single-device problem

2. **vLLM on ARM64 works**, with specific pitfalls around CUDA drivers and memory configuration

3. **Agentic pipelines on edge** are not just possible — they're production-ready today

4. **Async pipeline + composite memory** is the architecture that scales beyond chatbots

5. **Sovereign, on-premise, no cloud** — not a constraint, a feature

The Jetson Thor T5000 is not a bigger Orin. It's a different category of device. The software ecosystem will take 12-18 months to catch up. Build now.

---

*BrainiaK is an open R&D project. Architecture papers forthcoming (NeurIPS/ICML target). Contact: jeanphi.garnier@brainiak.tech.*

Brainiak

3 projects • 0 followers

PhD Math / Eco AI whisperer && Architect Designer Topologic for Agentic Systems

Thanks to Jeanphi.

Claude at Home ( Agentic System on Jetson Thor T5000)

Things used in this project

Hardware components

Software apps and online services

Story

Credits

Brainiak

Comments

Embed the widget on your own site

Claude at Home ( Agentic System on Jetson Thor T5000)

Claude at Home ( Agentic System on Jetson Thor T5000)

Things used in this project

Hardware components

Software apps and online services

Story

Credits

Brainiak

Comments

Related channels and tags