Deploying modern object detection models on edge devices with limited resources (like the 4GB Jetson Orin Nano) is always a trade-off between accuracy and speed. While Python frameworks (like ultralytics) are great for training, they often introduce significant overhead during inference.
In this project, I explore the limits of the Jetson Orin Nano by ditching Python for a pure C++ implementation using NVIDIA TensorRT. My goal was twofold:
Achieve real-time performance (>30 FPS) for YOLO models.
- Achieve real-time performance (>30 FPS) for YOLO models.
Benchmark the newly released YOLOv26 against the stable YOLOv8 to see if the "End-to-End" architecture holds up in a strict TensorRT environment.
- Benchmark the newly released YOLOv26 against the stable YOLOv8 to see if the "End-to-End" architecture holds up in a strict TensorRT environment.
My initial tests with Python on the Jetson Orin Nano (4GB) hit a bottleneck. Memory usage was high, and the Python Global Interpreter Lock (GIL) combined with overhead made it difficult to maintain a stable high framerate. To solve this, I built a custom C++ inference pipeline that handles:
Media I/O: OpenCV (with hardware acceleration where possible).
- Media I/O: OpenCV (with hardware acceleration where possible).
Preprocessing: CUDA kernels (Resize, Normalize, CHW conversion).
- Preprocessing: CUDA kernels (Resize, Normalize, CHW conversion).
Inference: TensorRT Engine (FP16 precision).
- Inference: TensorRT Engine (FP16 precision).
Post-processing: C++ implementation of NMS and coordinate mapping.
- Post-processing: C++ implementation of NMS and coordinate mapping.
One of the core experiments of this project was attempting to deploy the experimental YOLOv26. However, I encountered a significant hurdle: Confidence Misalignment.
While YOLOv8 converted to TensorRT perfectly, YOLOv26 exhibited bounding box drift and inaccurate confidence scores in C++. To understand why, I analyzed the ONNX graphs of both models (compared with YOLOv10 for reference).
Model Architecture DiscrepancyYOLOv10 / v8 (Optimized): The ONNX export includes the complete post-processing subgraph (TopK and Gather operators). The output shape is typically 1x300x6, allowing for true End-to-End NMS-free inference.
- YOLOv10 / v8 (Optimized): The ONNX export includes the complete post-processing subgraph (
TopKandGatheroperators). The output shape is typically1x300x6, allowing for true End-to-End NMS-free inference.
YOLOv26 (Default Export): The exported v26 model outputs 1x84x8400. It lacks the embedded end-to-end post-processing subgraph.
- YOLOv26 (Default Export): The exported v26 model outputs
1x84x8400. It lacks the embedded end-to-end post-processing subgraph.
Conclusion: The "NMS-Free" feature advertised for v26 relies on specific Python-side handling or specific export arguments that are not yet standard. In a raw TensorRT C++ environment, this fallback to the traditional output format causes compatibility issues with standard post-processing pipelines.
Note: For the stability of this project's code release, I have set YOLOv8n as the default model, as it provides the most stable industrial-grade performance.
Performance BenchmarksI tested the inference pipeline across three different configurations. The results clearly show the superiority of the C++ TensorRT approach on edge hardware.
1. Mac Mini (M-Series Chip)CPU Inference: ~21.4 FPS
- CPU Inference: ~21.4 FPS
MPS (GPU) Inference: ~20.5 FPS
Insight: On macOS, the MPS backend had higher instant FPS but suffered from synchronization latency, resulting in a lower average FPS than the CPU for video streams.
- Insight: On macOS, the MPS backend had higher instant FPS but suffered from synchronization latency, resulting in a lower average FPS than the CPU for video streams.
- MPS (GPU) Inference: ~20.5 FPSInsight: On macOS, the MPS backend had higher instant FPS but suffered from synchronization latency, resulting in a lower average FPS than the CPU for video streams.
ONNX Runtime: ~16.0 FPS
The overhead of the Python runtime and ONNX interpretation limits the performance.
- The overhead of the Python runtime and ONNX interpretation limits the performance.
- ONNX Runtime: ~16.0 FPSThe overhead of the Python runtime and ONNX interpretation limits the performance.
Video Inference (No Display):33.2 FPS
- Video Inference (No Display):33.2 FPS
Latency: ~12ms (End-to-End)
- Latency: ~12ms (End-to-End)
Throughput: ~90 FPS (Raw benchmark with trtexec)
- Throughput: ~90 FPS (Raw benchmark with
trtexec)
By switching to C++ and TensorRT, we achieved a ~100% performance boost compared to the Python implementation on the same hardware, making it viable for real-time robotic applications.
How to Run the CodePrerequisitesNVIDIA Jetson (Orin Nano/NX/AGX)
- NVIDIA Jetson (Orin Nano/NX/AGX)
JetPack 6.x (CUDA, TensorRT installed)
- JetPack 6.x (CUDA, TensorRT installed)
OpenCV (with GStreamer support recommended)
- OpenCV (with GStreamer support recommended)
You can use my script to export the YOLOv8 model to ONNX. Note that we use opset=18 for maximum compatibility.
Python
from ultralytics import YOLO
model = YOLO("yolov8n.pt")
model.export(format="onnx", imgsz=640, dynamic=False, simplify=True, opset=18)Step 2: Build the EngineUse trtexec to convert the ONNX model to a highly optimized TensorRT engine (FP16 precision is recommended for Orin Nano).
Bash
/usr/src/tensorrt/bin/trtexec \
--onnx=yolov8n.onnx \
--saveEngine=yolov8n_fp16.engine \
--fp16Step 3: Compile and RunNavigate to the C++ project directory and build using CMake.
Bash
mkdir build && cd build
cmake ..
make -j4
./yolo_appFuture WorkWhile the current system runs YOLOv8 flawlessly, solving the YOLOv26 export issue is next on the roadmap. I plan to:
Investigate custom ONNX export scripts to force the inclusion of TopK layers for v26.
- Investigate custom ONNX export scripts to force the inclusion of
TopKlayers for v26.
Integrate this perception module into a ROS 2 node for my RoboCup Rescue simulation project.
- Integrate this perception module into a ROS 2 node for my RoboCup Rescue simulation project.
If you are interested in the bleeding edge of Embedded AI, feel free to fork the repo and contribute!








Comments