There is a quiet irony in using an AI coding agent — running on GPU-accelerated infrastructure — to port GPU-accelerated software to an embedded GPU platform. The agent understands CUDA architectures, knows the difference between SM 8.6 and SM 8.7, recalls that nvidia-cudss-cu12 from PyPI drags in CUDA 12.9 cuBLAS libraries that conflict with the CUDA 12.6 system libraries baked into NVIDIA's L4T JetPack images, and can reason about why flash-attn wheels compiled against an older PyTorch ABI will segfault on torch 2.9.x with undefined SymInt symbols. It knows these things not because someone wrote a Jetson porting guide — no such guide exists for ACE-Step 1.5 — but because the knowledge is distributed across PyTorch issue trackers, NVIDIA forum posts, Jetson AI Lab documentation, and thousands of Dockerfiles that came before.
This blog post is about what happened when we pointed that capability at a real problem: getting ACE-Step 1.5, an open-source music foundation model that combines a Language Model planner with a Diffusion Transformer for audio synthesis, running with full GPU acceleration inside a Docker container on an NVIDIA Jetson AGX Orin 64GB.
The ingredients were simple: VS Code with GitHub Copilot powered by Claude Opus 4.6, the ACE-Step 1.5 source code cloned locally, and a Jetson AGX Orin on the network. The agent had full access to the project's source — the pyproject.toml, the gpu_config.py hardware detection module, the llm_inference.py LLM handler, the AGENTS.md contribution guidelines — and could read, search, edit, and run terminal commands against the codebase in real time. No separate documentation site was consulted. No Stack Overflow tabs were opened. The agent's knowledge of CUDA architectures, PyTorch packaging, Docker multi-stage builds, and aarch64 platform constraints came from its training, and its knowledge of this specific project came from reading the source code in the workspace. That combination — domain expertise plus live codebase access — is what made the entire porting effort possible in a single session.
ACE-Step 1.5 is a sophisticated multi-model pipeline. It loads a text-processing LLM (available in 0.6B, 1.7B, and 4B parameter variants), a Diffusion Transformer (DiT), a VAE decoder, and a text encoder. On a desktop x86 machine with a discrete NVIDIA GPU, the setup story is simple: uv sync, launch the Gradio UI, generate music. The project's pyproject.toml handles everything.
On NVIDIA Jetson, almost every assumption in that workflow is wrong.
Python Version Lock-InThe Jetson AI Lab — NVIDIA's official repository of pre-built aarch64 wheels for PyTorch, Triton, and related libraries — only publishes wheels for CPython 3.10. ACE-Step's upstream toolchain targets Python 3.11-3.12. This is not a soft preference; it is a hard constraint imposed by binary wheel availability. There are no cp311 or cp312 PyTorch wheels for Jetson. You use 3.10 or you compile from source (a multi-hour endeavour on Orin hardware that frequently fails due to memory pressure during compilation).
ACE-Step's dependency specification includes platform markers for aarch64:
"torch==2.10.0+cu130; sys_platform == 'linux' and platform_machine == 'aarch64'"These markers target the NVIDIA DGX Spark — a server-class aarch64 platform with CUDA 13.0 and completely different GPU architecture. Running uv sync or pip install . on Jetson dutifully resolves these markers, pulls cu130 wheels, and produces a PyTorch installation that cannot execute a single CUDA kernel on Jetson's SM 8.7 (Ampere) GPU. The error message — "no kernel image is available for execution on the device" — is technically accurate but gives no indication that the root cause is an architectural mismatch between the wheel's compiled kernels and the physical GPU.
A human developer encountering this for the first time might spend hours on the wrong trail, suspecting driver issues, CUDA toolkit versions, or container runtime configuration. The coding agent identified the root cause immediately: standard PyPI cu126 wheels do not include SM 8.7 kernels, and the cu130 wheels from pyproject.toml target an entirely different GPU generation. The fix: install PyTorch from NVIDIA's Jetson AI Lab pip index (https://pypi.jetson-ai-lab.io/jp6/cu126/+simple/), which provides wheels compiled specifically for Jetson GPU architectures.
Modern Python packaging tools like uv are extraordinarily good at resolving dependency graphs — on x86_64. On aarch64, the story changes. Many packages publish source distributions but no binary wheels for aarch64. Others have binary wheels but compiled against system libraries that differ between Jetson's Ubuntu 22.04 / L4T userspace and a standard aarch64 Linux distribution. Still others — like torchcodec — explicitly exclude aarch64 in their own platform markers.
The practical consequence is that pip install . against the project's pyproject.toml will attempt to compile Fortran (for scipy), link against OpenBLAS (which must be manually installed), and resolve numpy>=1.24 against an L4T base image that ships numpy 1.21. Each of these is individually solvable, but the aggregate debugging load is substantial. The coding agent produced a Dockerfile that pre-installs libopenblas-dev, liblapack-dev, and gfortran as system packages, bootstraps a modern numpy before touching any ML libraries, and enumerates every Python dependency explicitly rather than relying on pyproject.toml resolution — because the markers in that file are actively hostile to Jetson.
NVIDIA Jetson platforms use a fundamentally different CUDA architecture from desktop and data centre GPUs. Understanding these differences is critical for porting GPU-accelerated software, and it is precisely the kind of domain knowledge that AI-assisted tooling handles well.
SM 8.7: The Orin ArchitectureThe Jetson AGX Orin uses the Ampere architecture at SM (Streaming Multiprocessor) version 8.7. This is distinct from:
- SM 8.0 (A100, data centre Ampere)
- SM 8.6 (RTX 3060/3070/3080/3090, desktop Ampere)
- SM 8.9 (RTX 4090, Ada Lovelace)
- SM 9.0 (H100, Hopper)
When PyTorch is compiled for CUDA, it includes PTX or SASS (native binary) code for specific SM versions. The standard PyPI wheels for cu126 include kernels for SM 5.0 through SM 9.0a — but critically, not SM 8.7. PTX forward-compatibility exists in theory (a PTX image compiled for SM 8.0 can JIT-compile for SM 8.7 at runtime), but in practice many operations either fail outright or fall back to dramatically slower code paths. NVIDIA's Jetson AI Lab wheels solve this by compiling native SM 8.7 SASS code explicitly.
Jetson platforms use unified memory — the GPU and CPU share the same physical DRAM. On the AGX Orin 64GB, that means 64GB total, not 64GB for CPU plus additional GPU VRAM. Every byte allocated by PyTorch's CUDA allocator reduces what is available for CPU operations, model loading, and the Python runtime itself.
This architectural reality means that GPU memory management strategies designed for discrete GPUs (where 24GB VRAM and 64GB system RAM are independent pools) need re-evaluation. ACE-Step's gpu_config.py module detects available VRAM and applies tiered configuration — but its detection assumes discrete GPU memory reporting, where torch.cuda.get_device_properties(0).total_memory returns only GPU memory. On Jetson, this value reflects the unified pool, which can be misleading for the auto-offloading heuristics.
PyTorch 2.9+ introduced a runtime dependency on libcudss.so.0 (CUDA Direct Sparse Solver). The library is provided by the nvidia-cudss-cu12 pip package. On x86, installing this package is invisible — it pulls compatible CUDA runtime libraries and everything works. On Jetson, it triggers a chain reaction:
nvidia-cudss-cu12depends onnvidia-cublas-cu12(CUDA 12.9 version)nvidia-cublas-cu12installs cuBLAS 12.9 shared libraries- These shadow the cuBLAS 12.6 libraries shipped in the L4T JetPack base image
- PyTorch loads the 12.9 cuBLAS at runtime
- The 12.9 cuBLAS is incompatible with the 12.6 CUDA runtime →
CUBLAS_STATUS_NOT_INITIALIZED
The fix is surgical: install nvidia-cudss-cu12 for the .so, then immediately uninstall the conflicting transitive dependencies (nvidia-cublas-cu12, nvidia-cuda-runtime-cu12, nvidia-cusparse-cu12, nvidia-nvjitlink-cu12) so that PyTorch falls back to the system CUDA 12.6 libraries. This is the kind of non-obvious, platform-specific knowledge that an AI coding agent can surface in seconds because it has encountered the pattern across hundreds of similar dependency conflicts.
nano-vllm (the lightweight vLLM implementation bundled with ACE-Step) uses CUDA graph capture to accelerate LLM inference by recording and replaying GPU kernel sequences. On desktop GPUs, this works transparently. On Jetson, SDPA (Scaled Dot-Product Attention) paged-cache decode calls .item() — a GPU-to-CPU synchronisation point — during graph capture, which is illegal inside a CUDA graph recording context.
The agent diagnosed this by reading the traceback, understanding the CUDA graph capture restriction, and producing a targeted fix: auto-detect Jetson GPUs by checking for "orin", "xavier", or "tegra" in the CUDA device name, and set enforce_eager=True to disable graph capture. The patch to llm_inference.py was six lines:
is_jetson = False
if device == "cuda" and torch.cuda.is_available():
try:
dev_name = torch.cuda.get_device_name(0).lower()
is_jetson = any(k in dev_name for k in ("orin", "xavier", "tegra"))
if is_jetson:
logger.info(f"Jetson GPU detected ({dev_name}): disabling CUDA graph capture")
except Exception:
pass
enforce_eager_for_vllm = bool(is_rocm or is_jetson)This is a case where the AI tooling's awareness of edge cases was decisive. The error was a runtime crash deep inside nano-vllm's CUDA graph capture path. A human developer would need to understand CUDA graph semantics, trace the .item() call through SDPA, realise it is a synchronisation violation, and find the configuration toggle — all while working on unfamiliar embedded hardware. The agent collapsed that investigation into a single iteration.
The Jetson AI Lab publishes flash-attn wheels for aarch64, but they were compiled against an older PyTorch ABI. Attempting to load them with torch 2.9.x produces undefined symbol errors for SymInt-related C++ symbols — a breaking ABI change introduced in recent PyTorch versions. nano-vllm gracefully falls back to PyTorch's built-in SDPA attention without flash-attn, so the solution was simply to not install it. But the agent needed to verify that fallback path existed and that no codepath in ACE-Step hard-requires flash-attn before making that decision.
torchao (PyTorch's quantization and optimization library) is available for aarch64, but installing it on Jetson triggers a different failure: diffusers 0.36.0 has a bug in its torchao quantizer integration where the error handler references an undefined logger variable. When torchao is present but its UInt4Tensor module layout doesn't match what diffusers expects, the error handler itself crashes, taking down VAE model loading entirely. The agent identified this as a diffusers bug (not a torchao bug, not a Jetson bug) and disabled torchao in the Dockerfile with a clear comment explaining why — and noting that quantization features are not critical for Jetson inference.
Each of the issues above represents a distinct research task that a human developer would need to solve sequentially. The cumulative debugging time for SM 8.7 architecture mismatches, cuBLAS version conflicts, ABI-incompatible flash-attn wheels, CUDA graph capture failures on SDPA, and torchao/diffusers interaction bugs could easily span days or weeks of investigation.
The AI coding agent resolved all of these in a single collaborative session. The workflow was iterative:
- Prompt: "Create a Dockerfile that builds ACE-Step 1.5 on NVIDIA Jetson with GPU acceleration"
- Build → fail → diagnose → fix: eight build iterations, each surfacing a new platform-specific issue
- Runtime test: GPU smoke test (CUDA availability, SM architecture verification, matrix multiplication)
- End-to-end test: Generate a 5-second audio clip — 6, 541 MB peak GPU memory, 3.3 seconds generation time, valid WAV output
- Optimisation: Switch from PyTorch backend to vllm + 4B LM model for best quality (per upstream README recommendation for ≥24GB GPUs)
The agent did not hallucinate solutions. It proposed a fix, observed the build or runtime failure, read the error, and adjusted. When it suggested installing flash-attn, the ABI crash taught it to exclude it. When it set the default backend to pt, the user correctly pointed out that the README recommends vllm + 4B for ≥24GB systems — and the agent updated the Dockerfile, docker-compose, and entrypoint accordingly.
A critical insight from this project is that AI coding agents are not limited to writing source code. They operate within an environment — VS Code, with its integrated terminal, Docker extension, and file system access — and can use every tool available in that environment.
Building and Iterating on Docker ImagesThe agent built Docker images directly from the VS Code terminal:
docker build -f Dockerfile.jetson -t acestep-jetson .When builds failed, the agent read the terminal output, identified the failing layer, and edited the Dockerfile — all within the same VS Code session. There was no context switch between "write code" and "test code." The agent treated the Docker build as a compilation step with error output, applying the same diagnose-and-fix loop it would use for a Python syntax error.
Running Containers and Inspecting Runtime BehaviourOnce the image built successfully, the agent launched containers with GPU access:
docker run --runtime nvidia -it --rm \
-p 7860:7860 \
-v $(pwd)/checkpoints:/app/checkpoints \
acestep-jetsonIt then used docker exec to run diagnostic commands inside the running container:
docker exec acestep-jetson python -c "
import torch
print(f'CUDA: {torch.cuda.is_available()}')
print(f'Device: {torch.cuda.get_device_name(0)}')
print(f'SM: {torch.cuda.get_device_capability()}')
print(f'Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB')
"Copying Test Scripts Into Running ContainersWhen benchmarking vllm vs PyTorch backends, the agent created a test script locally and copied it into the running container:
docker cp /tmp/test_pt_vs_vllm.py acestep-jetson:/tmp/test_pt_vs_vllm.py
docker exec acestep-jetson python /tmp/test_pt_vs_vllm.pyThis is not "code generation" — it is using the system's Docker toolchain as an extension of the development workflow. The agent understands docker cp, docker exec, docker logs, and docker rm as first-class tools, not just as commands to suggest to the user.
The agent produced a complete docker-compose.jetson.yml that encoded the entire runtime configuration:
services:
acestep:
build:
context: .
dockerfile: Dockerfile.jetson
runtime: nvidia
environment:
- ACESTEP_LLM_BACKEND=vllm
- ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-4B
- ACESTEP_INIT_SERVICE=true
volumes:
- ./checkpoints:/app/checkpoints
- ./gradio_outputs:/app/gradio_outputs
ports:
- "7860:7860"
shm_size: "2gb"
healthcheck:
test: curl -sf http://localhost:7860/ || exit 1
start_period: 300sThis compose file is itself a testable artefact. docker compose up is the test. The health check verifies the Gradio server started. The bind mount for gradio_outputs lets a human (or a CI system) verify that generated audio files appear on the host filesystem. The agent authored all of this as part of the same flow that produced the Dockerfile and the runtime configuration patches.
VS Code's Docker extension provides a GUI for managing containers, images, and volumes. With the container running, you can:
- Attach a shell to the running container directly from the VS Code sidebar
- View logs in real time without switching to a terminal
- Browse the container's filesystem to inspect generated output files
- Right-click → Inspect to examine container configuration, environment variables, and mounted volumes
- Use the Dev Containers extension to open the container as a full VS Code remote development environment, with IntelliSense, debugging, and terminal access inside the container
The AI agent's terminal-based Docker commands and VS Code's visual Docker tooling are complementary. The agent handles the iterative build-test-fix cycle at speed; the developer uses the GUI to browse results, inspect state, and validate output quality (in this case, listening to generated music files).
Rapid Test GenerationBeyond infrastructure, the agent generates test scripts tailored to the specific platform constraints. When we needed to verify that the Jetson GPU was correctly detected and enforce_eager was being set, the agent didn't just suggest "check the logs" — it produced a targeted Python script that:
- Imported
torchand verified CUDA availability - Queried
torch.cuda.get_device_name(0)and checked for Jetson identifiers - Ran a matrix multiplication to confirm SM 8.7 kernel execution
- Timed the operation to establish a baseline
When we needed to compare vllm vs pt backend performance, the agent wrote a benchmark script that initialised both backends sequentially, ran timed inference passes, and reported comparative results — revealing that while vllm was ~60% slower on raw Phase 1 inference (30.5s vs 19.1s), the 4B model's quality advantage justified the tradeoff on a 64GB system.
The Jetson auto-detection patch to llm_inference.py is a clean, upstreamable change. It detects Jetson GPUs by device name and disables CUDA graph capture — a fix that benefits all Jetson users, not just Docker deployments. The agent produced the patch with proper code style (matching the existing codebase conventions), added a log message using the project's loguru logger (not print()), and scoped the change to exactly the lines that needed modification.
The Dockerfile and docker-compose file are new files that don't modify any existing code. They can be contributed as a standalone Jetson support PR with zero risk to existing platforms — exactly the kind of minimal-scope, low-risk contribution that open-source maintainers welcome.
And that is exactly what we did. The agent forked the repository, created a feature branch, committed the four changed files, pushed to the fork, and opened a pull request — all from the VS Code terminal, using gh (the GitHub CLI) as another tool in its arsenal. The resulting PR is live and available for review:
PR #735: feat: Add NVIDIA Jetson Docker support with GPU acceleration
The PR contains the Dockerfile.jetson, docker-compose.jetson.yml, .dockerignore, and the six-line llm_inference.py patch for Jetson GPU auto-detection. The blogpost.md you are reading now was deliberately excluded — it is not part of the codebase. Readers can inspect the exact diff, review the Dockerfile layer by layer, and reproduce the build on their own Jetson hardware.
The agent's awareness of the project's contribution guidelines (documented in AGENTS.md and CONTRIBUTING.md) meant that the patch naturally adhered to the "solve one problem per task/PR" and "do not alter non-target hardware/runtime paths" policies. It didn't refactor gpu_config.py while it was in there. It didn't "clean up" unrelated code. It made the smallest viable change and stopped.
Porting ACE-Step 1.5 to Jetson is not just an exercise in cross-compilation — it opens the door to genuinely new workflows on systems equipped with OpenClaw hardware and NVIDIA Jetson accelerators. With the full ACE-Step pipeline running locally — DiT, VAE, text encoder, and a 4B parameter LLM — the Jetson becomes a self-contained music production engine that requires no cloud connectivity, no API keys, and no per-generation costs.
LLM-Orchestrated Concept AlbumsConsider the possibilities when you pair a locally running LLM (for planning and lyric generation) with ACE-Step's music generation models, all on the same device. An LLM can be prompted to:
- Research a corpus: Ingest and analyse the complete works of a particular author — poems, short stories, essays — and extract themes, imagery, recurring motifs, and emotional arcs
- Design an album structure: Propose a track listing with titles, tempos, keys, and genre descriptors that map the source material's narrative arc into a musical journey
- Write lyrics: Generate lyrics for each track that draw directly from the source material's language and imagery, adapted to song structure with verses, choruses, and bridges
- Compose captions: Produce the detailed musical captions that ACE-Step's LM planner uses to guide the DiT — specifying instrumentation, production style, tempo, key signature, and mood
- Execute generation: Call ACE-Step's API to generate each track, iterating on parameters until the output matches the creative vision
Imagine asking an LLM to produce a plan that executes against the locally running ACE-Step models to fully research the works of Edgar Allan Poe — The Raven, The Fall of the House of Usher, The Masque of the Red Death, Annabel Lee, The Tell-Tale Heart — and generate a concept album in the spirit of The Alan Parsons Project's Tales of Mystery and Imagination (1976). That landmark album translated Poe's gothic atmosphere into progressive rock through tracks like "A Dream Within a Dream, " "The Raven, " and "(The System of) Doctor Tarr and Professor Fether."
With ACE-Step running locally on Jetson, the LLM could:
- Analyse Poe's metrical patterns and propose matching time signatures
- Map the escalating paranoia of The Tell-Tale Heart to a track that builds from whispered acoustic passages to distorted crescendo
- Translate the melancholy of Annabel Lee into a ballad caption specifying "minor key, solo piano intro, orchestral strings, female vocal, 72 BPM, 4/4"
- Generate multiple variations of each track and select the most cohesive set
This is not speculative — every component exists today. The LLM plans, ACE-Step generates, and the Jetson provides the compute. The missing piece was getting these models running efficiently on Jetson hardware, which is exactly what this porting effort accomplished.
Beyond Music: The Embedded AI Creation PlatformThe pattern generalises. Any creative AI workflow that combines planning (LLM) with generation (diffusion model) can run entirely on Jetson hardware. Music is a compelling first case because ACE-Step's architecture — an LM planner feeding structured prompts to a DiT — maps cleanly onto the "research → plan → execute" pattern that LLMs excel at. But the same Jetson container could host image generation, video synthesis, or multimodal pipelines as those models achieve similar efficiency.
The Full CircleWe used a GPU-accelerated coding agent to build a GPU-accelerated Docker container for a GPU-accelerated music generation model, targeting an embedded GPU platform with non-standard CUDA architecture. The agent navigated SM 8.7 kernel compilation requirements, CUDA 12.6/12.9 library conflicts, PyTorch ABI incompatibilities, CUDA graph capture restrictions, and Python version constraints — producing a working, production-ready container in a single collaborative session.
The resulting system generates commercial-grade music on an NVIDIA Jetson AGX Orin using less than 7GB of peak GPU memory for inference. It runs the highest-quality configuration (vllm backend + 4B LM model) and auto-initialises models on container startup. The Dockerfile is self-documenting, the docker-compose file is ready for deployment, and the Jetson auto-detection patch is upstreamable.
Every edge case we hit — every library conflict, every architecture mismatch, every runtime crash — was the kind of problem that traditionally makes embedded GPU development a specialised skill requiring deep platform knowledge. The AI coding agent didn't eliminate the need for that knowledge. It had that knowledge, and applied it at the speed of iteration rather than the speed of research.
That is the thesis: GPU-powered coding agents don't just write code faster. They compress the entire research-diagnose-fix cycle for GPU-accelerated software development from days into minutes. When the target platform is as nuanced as NVIDIA Jetson — where every assumption from the x86 world needs re-examination — that compression is the difference between "someone should port this" and "it's running."




Comments