*How we deployed a full orchestrated knowledge pipeline with a 122B parameter model on a single edge device, no cloud required.*
---
## Why This Matters
When NVIDIA launched the Jetson Thor T5000 in mid-2025, the embedded AI world changed quietly but profoundly. For the first time, a single edge device ships with **128 GB of unified LPDDR5X memory** β enough to run a 122B parameter model in AWQ-4bit quantization entirely in local memory, without splitting across devices, without cloud inference, without data leaving the premises.
Most published work on Thor T5000 focuses on robotics: humanoids, autonomous vehicles, sensor fusion. That makes sense β it's what NVIDIA markets. But we went in a different direction.
We built **BrainiaK**: a full agentic knowledge pipeline running on a single Thor T5000, serving as a personal cognitive assistant and development agent. Multi-mode routing, composite memory, tool execution, and a closed-loop self-optimization system called MathCore. All on-premise. All sovereign.
This article covers what it actually takes to get there.
---
## Hardware Setup
**The Jetson AGX Thor T5000 developer kit:**
| Spec | Value |
|---|---|
| GPU | NVIDIA Blackwell (B-class) |
| AI Performance | 2, 070 TOPS (FP4) |
| Memory | 128 GB LPDDR5X unified (CPU + GPU) |
| Storage | 1 TB NVMe (PCIe Gen5) |
| Network | 10 GbE onboard |
| Power | 40β130 W TDP |
| OS | JetPack 7.1 (Ubuntu 22.04 ARM64) |
The unified memory architecture is the key insight. On a discrete GPU setup, you pay a PCIe transfer cost every time data moves between system RAM and VRAM. On Thor T5000, the CPU and GPU share the same physical memory pool. For LLM inference, this means the model weights live once, accessible by both compute units simultaneously.
**Model we're running:**
```
Qwen3.5-122B-A10B-AWQ (4-bit quantized)
Model size on disk : ~65 GB
GPU memory at load : ~72 GB (leaves ~55 GB for KV cache + system)
Throughput measured : ~13 tokens/second
```
We also tested Qwen3.5-235B-A22B-AWQ which fits within the 128 GB envelope β ~116 GB loaded. Throughput drops to ~6-7 tok/s but quality is exceptional for deep reasoning tasks.
---
## Serving with vLLM on JetPack 7.1
vLLM 0.16.0rc2 is the first version with solid Jetson Thor support. Getting it running on ARM64 JetPack 7.1 has specific gotchas.
**Installation:**
```bash
# JetPack 7.1 ships CUDA 13.0, Driver 580
# vLLM nightly for Jetson (ARM64 wheel)
pip install vllm==0.16.0rc2 \
--extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v71
# Verify GPU is visible
python3 -c "import torch; print(torch.cuda.get_device_name(0))"
# Expected: NVIDIA Jetson Thor
```
**Launch command:**
```bash
vllm serve Qwen/Qwen3.5-122B-A10B-Instruct-AWQ \
- -served-model-name qwen-a \
- -host 0.0.0.0 \
- -port 8001 \
- -gpu-memory-utilization 0.85 \
- -max-model-len 32768 \
- -tool-call-parser qwen3_coder \
- -enable-auto-tool-choice \
- -trust-remote-code
```
**Pitfalls we hit:**
1. **safe.directory git error** when the container runs as root but the repo is owned by another user β fix: `git config --global --add safe.directory /workspace`
2. **USB-C confusion** β the T5000 dev kit has two USB-C ports. The one near USB-A (labeled 5a) is data/recovery. The one near HDMI (5b) is power. Confuse them and the board won't boot.
3. **nvpmodel default mode** β JetPack boots in a balanced power mode. Switch to MAXN for maximum performance:
```bash
sudo nvpmodel -m 0 && sudo jetson_clocks
```
4. **Swap** β disable it. With 128 GB unified memory you don't need swap, and it causes latency spikes:
```bash
sudo swapoff -a
```
---
## The Pipeline Architecture
A chatbot answers. BrainiaK routes, orchestrates, and adapts. Here is the full architecture:
```
**Three execution modes:**
| Mode | When | Max Tokens | Latency |
|-----------|---------|-------------------|--------------|
| Fast | Simple queries, short content | 4, 000 | ~5-15s |
| Normal | Standard analysis, medium depth | 32, 000 | ~15-60s |
| Heavy (Thinking) | Complex reasoning, deep synthesis | 32, 000 | ~2-10 min |
The Arbiter node decides the mode based on content length, explicit tags, and MathCore recommendations. It can upgrade or downgrade a suggested class:
```python
# Upgrade fastβnormal if content too long
if suggested == ExecutionClass.fast and content_len > 500:
return ExecutionClass.normal
# Upgrade fastβnormal if token budget too high for fast lane
if suggested == ExecutionClass.fast and max_tokens > 4000:
return ExecutionClass.normal
```
The Thinking mode uses Qwen3.5's native chain-of-thought β the model reasons internally in `<think>...</think>` blocks before producing its final answer. We strip the thinking trace from the stored response but log the latency.
---
## Tool Registry β The Agentic Layer
This is where BrainiaK diverges from "chatbot in a box." The Dev Agent node has access to 7 tools:
```python
TOOLS = [
"read_file", # Read any file with line-numbered output
"write_file", # Create or overwrite files
"edit_file", # Surgical string replacement in files
"list_dir", # Directory listing (ls with metadata)
"glob_files", # Pattern matching across the repo
"grep_files", # Ripgrep-powered content search
"bash", # Shell execution (timeout 30s, output 50KB)
]
```
The git workspace is mounted at `/workspace` inside the container. BrainiaK can read, edit, commit and push code autonomously:
```bash
# BrainiaK executing a git commit from inside Docker
bash("git -C /workspace add. && \
git -C /workspace commit -m '[fix] correct FPCA window' && \
git -C /workspace push")
```
This is exposed via `POST /v0/dev/chat` β a multi-turn conversation endpoint where the model maintains tool call history across turns. The model decides autonomously when to call tools and when to respond directly.
**Real example of what BrainiaK does in a single /v0/dev/chat session:**
1. `grep_files("MAX_TOKENS")` β finds all occurrences across the repo
2. `read_file("brainiak/mathcore/config.py")` β reads the config
3. `edit_file(...)` β changes the value
4. `read_file("tests/test_recommender.py")` β checks the test
5. `bash("python -m pytest tests/ -q")` β runs the test suite
6. `bash("git -C /workspace commit -m '...'")` β commits
All from a single natural language instruction. No human in the loop.
---
## Composite Memory
One of the most underrated components. Three layers working together:
```
Request comes in -> :
[BM25 lexical search] <-- fast exact match
[Qdrant vector search] <-- semantic similarity
[PostgreSQL hot memory] <-- recent interactions
[Memory Gate] -- deduplicates, ranks, injects into context
[LLM sees relevant past context]
```
The memory is tenant-isolated at the DB level. Every completed request stores:
- Query (up to 2, 000 chars)
- Response summary (up to 6, 000 chars)
- Execution class used
- Latency
- Tools called
On the next request, the pipeline retrieves the 7 most relevant past interactions and injects them as context. The model remembers what it did last week.
---
## Real Performance Numbers
Measured on Jetson Thor T5000, JetPack 7.1, Qwen3.5-122B-AWQ:
```
Mode | Avg Latency | Tokens/s | Typical use
--------------+-------------+----------+------------------
Fast | 8-15s | ~13 | Quick questions
Normal | 20-60s | ~13 | Analysis, code
Thinking | 2-8 min | ~13 | Deep synthesis
Dev Agent | 1-3 min | ~13 | Multi-tool tasks
PDF Translate | ~24s/batch | ~13 | 10-block batches
```
The throughput is consistent across modes β 13 tok/s is the hardware ceiling for 122B AWQ-4bit on this device. The latency difference between modes is entirely driven by output length and thinking trace length.
**HTTP client timeout**: set to 3, 600 seconds. At 13 tok/s with 32, 000 max tokens, a full-budget response takes ~41 minutes. The pipeline is fully async β `POST /v0/request` returns 202 immediately, the client polls for results.
---
## Docker Deployment
The entire stack runs in Docker Compose: PostgreSQL 16, NATS JetStream, and the BrainiaK Core service.
```yaml
# docker-compose.yml (excerpt)
services:
brainiak-core:
build:..
extra_hosts:
- "host.docker.internal:host-gateway"
environment:
BRAINIAK_VLLM_AGENT_ALPHA_URL: http://host.docker.internal:8001
volumes:
- ../brainiak.md:/app/brainiak.md:ro # personality (live reload)
- ../session_memory.md:/app/session_memory.md:ro
- ..:/workspace # full repo for git ops
- ~/.ssh:/root/.ssh:ro # git push from container
- ~/.gitconfig:/root/.gitconfig:ro
```
vLLM runs on the host directly (not in Docker) to maintain direct GPU access. The BrainiaK Core container calls it via `host.docker.internal:8001`.
**Dockerfile essentials for ARM64:**
```dockerfile
FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
libpq5 git openssh-client \
&& rm -rf /var/lib/apt/lists/*
```
git and openssh-client are required for the Tool Registry to execute git operations from inside the container.
---
## MathCore β The Closed-Loop Layer (Preview)
This is what differentiates BrainiaK from any other edge deployment.
Every request emits a telemetry event: node ID, latency, token counts, outcome. After 7 days of production data, a nightly pipeline activates and analyzes these distributions. It identifies behavioral regimes, detects when a node starts drifting from its historical envelope, and feeds recommendations back into the Arbiter β adjusting token budgets and routing weights for the next day.
```
Production traffic
- -> Telemetry aggregation (nightly)
- -> Statistical analysis of latency distributions
- -> Behavioral regime detection
- -> Drift alerts
- -> Recommender
- -> back into routing decisions (next day)
```
The pipeline reconfigures itself based on its own execution history. No human in the loop.
The mathematical framework behind MathCore is grounded in formal theory β multi-agent general equilibrium, functional data analysis, and dynamical systems applied to agentic architectures. The first preprint is available on arXiv:
**[arXiv:2602.21255](https://arxiv.org/abs/2602.21255)**.
MathCore is the implementation of that theory running live on this hardware. Full methodology papers are forthcoming (NeurIPS/ICML target).
As of publishing: **day 1/7** of the activation window. Follow-up article incoming.
---
## What's Next
- **Qwen3.5-235B-A22B-AWQ** β fits in 128 GB, ~6-7 tok/s. Already tested.
- **MathCore activation** (day 7) β closed-loop token budget optimization
- **Jetson 2 (L'Atelier)** β OCR (Qwen3-VL-8B) + embeddings, 25 GbE direct link
- **Hive** β 4-16 Jetson Thor nodes, InfiniBand HDR, physical implementation of the distributed intelligence topology from our forthcoming papers
- --
## Key Takeaways
1. **128 GB unified memory changes the edge AI equation** β 122B models are now a single-device problem
2. **vLLM on ARM64 works**, with specific pitfalls around CUDA drivers and memory configuration
3. **Agentic pipelines on edge** are not just possible β they're production-ready today
4. **Async pipeline + composite memory** is the architecture that scales beyond chatbots
5. **Sovereign, on-premise, no cloud** β not a constraint, a feature
The Jetson Thor T5000 is not a bigger Orin. It's a different category of device. The software ecosystem will take 12-18 months to catch up. Build now.
---
*BrainiaK is an open R&D project. Architecture papers forthcoming (NeurIPS/ICML target). Contact: jeanphi.garnier@brainiak.tech.*



Comments