If you have tried building any kind of vision system on the edge, say agricultural monitoring, inventory tracking, anything that needs to identify objects locally without pinging the cloud, you might have hit the same wall: traditional MCUs simply don’t have the memory to run object detection. Not enough RAM to buffer a camera frame and not enough ROM to store model weights. The usual fix is reaching for a Linux SBC, which immediately blows your power budget and inflates your BOM.
The STM32N6570-DK was designed to make that tradeoff unnecessary. ST uses an unconventional approach for this board’s design. The chip has no internal flash but instead pairs an external Octo-SPI flash with 4.2MB of internal SRAM and a dedicated Neural-ART Accelerator sitting alongside the Cortex-M55 core. The result is an MCU with enough fast memory to hold camera frame buffers, model weights, and NPU activations simultaneously. Add Edge Impulse to handle the MLOps pipeline in the mix, and what used to be weeks of C++ toolchain suffering becomes a workflow you can actually complete in a weekend. This article documents exactly how that works, including the not-so-obvious parts.
The Zero Internal Flash AdvantageFor computer vision tasks, the decoupling of the memory and the effective and distinct memory options for storage and runtime is what makes the board viable. Frame buffers, model weights, NPU activations, and system logic can all live in SRAM simultaneously without any of them getting crowded out. The AXI SRAM2 region is where the boot process specifically operates, with the linker script carving it into a 255KB code section and a 256KB data section, which has direct implications for binary size limits covered in the next section.
The no-flash design also means the chip needs a structured boot sequence to get everything into SRAM before inference can start. The Boot ROM, the First-Stage Boot Loader (FSBL), and the load-and-run process do the work of pulling binaries from external flash into internal memory at startup, so the model ultimately runs from fast SRAM rather than executing off the slower flash. On the hardware side, the rest of the board completes the vision-specific workflow: the ST Neural-ART Accelerator handles inference offload so the Cortex-M55 isn’t buried in matrix math, while an H264 encoder, NeoChrom 2.5D GPU, and onboard camera and microphone interfaces handle the I/O side without external chips. For the STM32N6570-DK, which is actually a development kit, the gap between a working demo and a shippable design is unusually small.STM32N6’s Start Sequence: Boot ROM to FSBL to Application
Because there’s no internal flash, the STM32N6 can’t just power on and start executing your application from address zero like a conventional MCU. Every boot begins with the on-chip Boot ROM, the one piece of code that lives in actual ROM and runs before anything else. It initializes the system, detects the reset source, then reaches out to the external Octo-SPI flash, pulls the First Stage Bootloader (FSBL) into AXI SRAM2, and authenticates it using ECDSA 256/384 signature verification before jumping to it. That last part is non-negotiable: the FSBL must be signed, or the Boot ROM won’t execute it in a secured-locked state. From there, the FSBL takes over, configuring system clocks, initializing the XSPI2 external memory interface, and copying the main application binary from external flash into internal SRAM via a Load and Run (LRUN) sequence. The full chain looks like this:
Boot ROM -> AXI SRAM2 load -> 27F6 FSBL (clock + XSPI2 init) U+27F6 Sensor buffer fill ->27F6 Neural-ART inference (adding the unicode character for long arrow:
Edge Impulse’s STM32N6 deployment uses the load-and-run template described above, which is suitable for many vision projects. If your project needs to go beyond the 511KB application ceiling, then you should consider the Execute in Place template described in this guide - Link.
The Software Bridge: Edge Impulse + ST Neural-ART Relocatable ModeGetting a model onto the STM32N6 without Edge Impulse means manually configuring the ST Neural-ART toolchain, handling memory layout for the NPU, and wrestling with the C++ compilation pipeline yourself, which is where most projects stall before inference ever runs. Edge Impulse handles all of that, targeting the Neural-ART Accelerator directly and outputting binaries that slot into the STM32N6’s boot process without manual intervention. The platform supports FOMO and YOLOv5 for this board, but we opted for YOLO-Pro, specifically the Pico and Nano variants. YOLO-Pro’s full spatial bounding boxes would typically crush a conventional MCU but on the STM32N6, backed by 4.2MB of contiguous SRAM and the Neural-ART Accelerator, the board absorbs the heavier architecture without issue. The more interesting feature is how Edge Impulse handles model deployment through ST Neural-ART’s relocatable mode. Instead of baking the neural network weights directly into the application firmware, the build process splits them out into a separate binary, network_data.hex, that lives at its own flash address independently of the main application. In practice this means you can update your model weights without recompiling or reflashing the firmware. Swap one model for another by reflashing a single file at a single address, and the rest of the system stays untouched. For anyone thinking beyond the demo toward actual field deployment, that separation is worth understanding before you start training.Building the Demo: Fruit Detection with YOLO-Pro
The dataset for this project covers two fruit classes, namely orange & avocado. The collection strategy matters more than the class count. Images captured under controlled lighting will produce a model that fails the moment a shadow crosses the frame or a leaf partially occludes the target. To avoid that, images were gathered across varied lighting conditions including direct backlighting, at multiple distances, and with natural occlusion present. Edge Impulse’s data acquisition workflow keeps this practical.
The edge-impulse-daemon command connects the STM32N6570-DK directly to your project, letting you capture and label images from the onboard camera without leaving the platform. Images were resized to a standardized input resolution at this stage, which reduces the CPU preprocessing cost at inference time rather than paying it on every frame. For this project, a resolution of 160 x 160 was chosen.
With the dataset labeled and uploaded, training in Edge Impulse Studio comes down to selecting the Object Detection learning block and choosing YOLO-Pro Pico or Nano as the model architecture (for this project, we chose the pico variant). Before running a full training job, the target profiler is worth checking first. Select the STM32N6570-DK as the target device and the profiler returns estimated RAM usage, flash consumption, and inference latency against the actual hardware. Catching a memory budget problem at this stage costs nothing; catching it after flashing costs you a debugging session.
The decision between Pico and Nano is straightforward: Pico is faster with a smaller footprint, Nano trades a modest memory increase for better accuracy. As a rule of thumb, start with Pico and only move up if detection confidence is insufficient for your project.
Deployment: Three Binaries, Three Addresses, One Order and Common PitfallsEdge Impulse’s STM32N6 export gives you three files: the First Stage Bootloader (ai_fsbl_cut_2_0.hex), the application firmware (firmware-st-stm32n6.bin), and the relocatable model weights (network_data.hex). Flashing them requires STM32CubeProgrammer, though Edge Impulse also includes platform scripts (flash.sh on Linux/macOS, flash.bat on Windows) that wrap the programmer and handle the commands for you. Before running either, flip the BOOT1 switch to the right position to enter DEV mode and reset the board with NRST. The board won’t accept a flash write otherwise. Flash order and addresses are fixed: FSBL at 0x70000000, application at 0x70100000, weights at 0x71000000. The bootloader only needs to be written once; on model updates, you only reflash the weights and firmware, which is the relocatable architecture.
The address assignments are not flexible. network_data.hex must land at 0x71000000 precisely. If arm-none-eabi-objcopy isn’t used to enforce that address, the application hard faults at inference time with nothing in the error output pointing back to the real cause. Once all three binaries are written, flip BOOT1 back left, reset, and the board boots from external flash into the full inference pipeline.
For high-resolution images, there is an invisible CPU tax worth knowing about before your first full run: the camera feed comes in at a resolution higher than the model’s input dimensions, and resizing those frames happens on the Cortex-M55 before anything reaches the NPU. At continuous inference rates, the preprocessing cost accumulates. If your measured FPS is lower than the profiler’s inference latency alone would suggest, this is the first place to look.
One final issue that will have you chasing ghost bugs through clean code: running the camera, Cortex-M55, and Neural-ART Accelerator simultaneously draws more current than a standard USB data port reliably delivers. The symptoms are brownouts, random resets, and erratic camera behavior, all of which look exactly like software problems. The fix is straightforward: connect CN8 for supplemental power and move jumper JP2 to 3-4:5V_USB_STLK. Do this before your first full inference run rather than after an hour of debugging a binary that was never broken.
Closing: What This Board UnlocksThe STM32N6 has a learning curve but once the pipeline clicks for the user, the board delivers something genuinely difficult to find elsewhere: stable real-time vision inference on an MCU, without a Linux kernel, without a significant power budget, and without coupling your model weights to your firmware every time you want to retrain.
The pattern established here, Edge Impulse for the MLOps pipeline, ST Neural-ART relocatable mode for deployment flexibility, YOLO-Pro Pico/Nano for the model architecture, transfers directly to use cases beyond fruit detection. Always-on cameras in industrial inspection, agricultural monitoring nodes, wearables that need local vision without a cloud dependency: all of them share the same core constraints this project was built around.
If you want to skip the setup pain and start from a working baseline, the public Edge Impulse project and accompanying GitHub repository are linked below. Follow the README.md, clone the dataset, reflash the weights, and the firmware stays exactly where it is.
Public Edge Impulse Project: Link
GitHub Repository: Link






Comments