Published January 18, 2026

Pushing Limits: YOLOv8 vs. v26 on Jetson Orin Nano

A high-performance C++ implementation for object detection on 4GB Jetson Orin Nano. Benchmarking YOLOv8 (Success) vs. YOLOv26 (Challenges)

IntermediateFull instructions provided20 hours54

Pushing Limits: YOLOv8 vs. v26 on Jetson Orin Nano

Things used in this project

Hardware components

NVIDIA Jetson Orin Nano Developer Kit

Apple Mac mini

Software apps and online services

OpenCV

Ultralytics YOLO

Story

Introduction

Deploying modern object detection models on edge devices with limited resources (like the 4GB Jetson Orin Nano) is always a trade-off between accuracy and speed. While Python frameworks (like ultralytics) are great for training, they often introduce significant overhead during inference.

In this project, I explore the limits of the Jetson Orin Nano by ditching Python for a pure C++ implementation using NVIDIA TensorRT. My goal was twofold:

Achieve real-time performance (>30 FPS) for YOLO models.

Achieve real-time performance (>30 FPS) for YOLO models.

Benchmark the newly released YOLOv26 against the stable YOLOv8 to see if the "End-to-End" architecture holds up in a strict TensorRT environment.

Benchmark the newly released YOLOv26 against the stable YOLOv8 to see if the "End-to-End" architecture holds up in a strict TensorRT environment.

The Challenge: Python vs. C++ on the Edge

My initial tests with Python on the Jetson Orin Nano (4GB) hit a bottleneck. Memory usage was high, and the Python Global Interpreter Lock (GIL) combined with overhead made it difficult to maintain a stable high framerate. To solve this, I built a custom C++ inference pipeline that handles:

Media I/O: OpenCV (with hardware acceleration where possible).

Media I/O: OpenCV (with hardware acceleration where possible).

Preprocessing: CUDA kernels (Resize, Normalize, CHW conversion).

Preprocessing: CUDA kernels (Resize, Normalize, CHW conversion).

Inference: TensorRT Engine (FP16 precision).

Inference: TensorRT Engine (FP16 precision).

Post-processing: C++ implementation of NMS and coordinate mapping.

Post-processing: C++ implementation of NMS and coordinate mapping.

Deep Dive: The YOLOv26 "Mystery"

One of the core experiments of this project was attempting to deploy the experimental YOLOv26. However, I encountered a significant hurdle: Confidence Misalignment.

While YOLOv8 converted to TensorRT perfectly, YOLOv26 exhibited bounding box drift and inaccurate confidence scores in C++. To understand why, I analyzed the ONNX graphs of both models (compared with YOLOv10 for reference).

Model Architecture Discrepancy

YOLOv10 / v8 (Optimized): The ONNX export includes the complete post-processing subgraph (TopK and Gather operators). The output shape is typically 1x300x6, allowing for true End-to-End NMS-free inference.

YOLOv10 / v8 (Optimized): The ONNX export includes the complete post-processing subgraph (TopK and Gather operators). The output shape is typically 1x300x6, allowing for true End-to-End NMS-free inference.

YOLOv26 (Default Export): The exported v26 model outputs 1x84x8400. It lacks the embedded end-to-end post-processing subgraph.

YOLOv26 (Default Export): The exported v26 model outputs 1x84x8400. It lacks the embedded end-to-end post-processing subgraph.

Conclusion: The "NMS-Free" feature advertised for v26 relies on specific Python-side handling or specific export arguments that are not yet standard. In a raw TensorRT C++ environment, this fallback to the traditional output format causes compatibility issues with standard post-processing pipelines.

Note: For the stability of this project's code release, I have set YOLOv8n as the default model, as it provides the most stable industrial-grade performance.

Performance Benchmarks

I tested the inference pipeline across three different configurations. The results clearly show the superiority of the C++ TensorRT approach on edge hardware.

1. Mac Mini (M-Series Chip)

CPU Inference: ~21.4 FPS

CPU Inference: ~21.4 FPS

MPS (GPU) Inference: ~20.5 FPS

Insight: On macOS, the MPS backend had higher instant FPS but suffered from synchronization latency, resulting in a lower average FPS than the CPU for video streams.

Insight: On macOS, the MPS backend had higher instant FPS but suffered from synchronization latency, resulting in a lower average FPS than the CPU for video streams.
MPS (GPU) Inference: ~20.5 FPSInsight: On macOS, the MPS backend had higher instant FPS but suffered from synchronization latency, resulting in a lower average FPS than the CPU for video streams.

2. Jetson Orin Nano (Python + ONNX)

ONNX Runtime: ~16.0 FPS

The overhead of the Python runtime and ONNX interpretation limits the performance.

The overhead of the Python runtime and ONNX interpretation limits the performance.
ONNX Runtime: ~16.0 FPSThe overhead of the Python runtime and ONNX interpretation limits the performance.

3. Jetson Orin Nano (C++ + TensorRT) - WINNER 🚀

Video Inference (No Display):33.2 FPS

Video Inference (No Display):33.2 FPS

Latency: ~12ms (End-to-End)

Latency: ~12ms (End-to-End)

Throughput: ~90 FPS (Raw benchmark with trtexec)

Throughput: ~90 FPS (Raw benchmark with trtexec)

By switching to C++ and TensorRT, we achieved a ~100% performance boost compared to the Python implementation on the same hardware, making it viable for real-time robotic applications.

How to Run the Code

Prerequisites

NVIDIA Jetson (Orin Nano/NX/AGX)

NVIDIA Jetson (Orin Nano/NX/AGX)

JetPack 6.x (CUDA, TensorRT installed)

JetPack 6.x (CUDA, TensorRT installed)

OpenCV (with GStreamer support recommended)

OpenCV (with GStreamer support recommended)

Step 1: Export the Model

You can use my script to export the YOLOv8 model to ONNX. Note that we use opset=18 for maximum compatibility.

Python

from ultralytics import YOLO
model = YOLO("yolov8n.pt")
model.export(format="onnx", imgsz=640, dynamic=False, simplify=True, opset=18)

Step 2: Build the Engine

Use trtexec to convert the ONNX model to a highly optimized TensorRT engine (FP16 precision is recommended for Orin Nano).

Bash

/usr/src/tensorrt/bin/trtexec \
  --onnx=yolov8n.onnx \
  --saveEngine=yolov8n_fp16.engine \
  --fp16

Step 3: Compile and Run

Navigate to the C++ project directory and build using CMake.

Bash

mkdir build && cd build
cmake ..
make -j4
./yolo_app

Future Work

While the current system runs YOLOv8 flawlessly, solving the YOLOv26 export issue is next on the roadmap. I plan to:

Investigate custom ONNX export scripts to force the inclusion of TopK layers for v26.

Investigate custom ONNX export scripts to force the inclusion of TopK layers for v26.

Integrate this perception module into a ROS 2 node for my RoboCup Rescue simulation project.

Integrate this perception module into a ROS 2 node for my RoboCup Rescue simulation project.

If you are interested in the bleeding edge of Embedded AI, feel free to fork the repo and contribute!

Credits

qwe 123

1 project • 0 followers

Pushing Limits: YOLOv8 vs. v26 on Jetson Orin Nano

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

The Challenge: Python vs. C++ on the Edge

Deep Dive: The YOLOv26 "Mystery"

Model Architecture Discrepancy

Performance Benchmarks

1. Mac Mini (M-Series Chip)

2. Jetson Orin Nano (Python + ONNX)

3. Jetson Orin Nano (C++ + TensorRT) - WINNER 🚀

How to Run the Code

Prerequisites

Step 1: Export the Model

Step 2: Build the Engine

Step 3: Compile and Run

Future Work

Code

Jetson-TensorRT-YOLO-CPP

Credits

qwe 123

Comments

Embed the widget on your own site

Pushing Limits: YOLOv8 vs. v26 on Jetson Orin Nano

Pushing Limits: YOLOv8 vs. v26 on Jetson Orin Nano

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

The Challenge: Python vs. C++ on the Edge

Deep Dive: The YOLOv26 "Mystery"

Model Architecture Discrepancy

Performance Benchmarks

1. Mac Mini (M-Series Chip)

2. Jetson Orin Nano (Python + ONNX)

3. Jetson Orin Nano (C++ + TensorRT) - WINNER 🚀

How to Run the Code

Prerequisites

Step 1: Export the Model

Step 2: Build the Engine

Step 3: Compile and Run

Future Work

Code

Jetson-TensorRT-YOLO-CPP

Credits

qwe 123

Comments

Related channels and tags