1. Introduction
2. Understanding the Axon NPU
2.1 Why the Axon NPU Uses INT8
2.2 Why Floating Point Is Not Supported in Axon NPU Hardware
2.3 Wake-on-AI Architecture
2.4 Typical Use Cases for the Axon NPU
2.5 TFLM Integration and the Compiler Pipeline
3 Theoretical Performance and Architecture Background
3.1 Axon NPU Technical data
3.2 Compute Throughput in GOPS
3.3 Atlazo Axon RDLA Architecture
3.4 Official Performance Claims
3.5 Nordic Specific Tools
4. Building the "MAC-Smasher" Benchmark Model
4.1 Setup Python Environment
4.2 Generate the Strict INT8 TFLite Model
4.3 Run the Model Generation Script
4.5 Why 256×256 Is the Right Test Size
5. Preparing the Nordic Toolchain and Hardware
5.1 Connect the nRF54LM20-DK
5.2 Open nRF Connect for Desktop
5.3 Verify the nrfutil Installation
5.4 Install the sdk-manager
5.5 Install the Required nRF Connect SDK Preview
5.6 Verify the SDK Installation
6 Compiling the Model for the Axon NPU
6.1 Create an Edge AI Workspace
6.2 Set ZEPHYR_BASE
6.3 Initialize the Workspace with west
6.4 Activate Conda Virtual Environment
6.5 Install the Compiler Requirements
6.6 Run the Axon Compiler
6.7 Check the Compiler Output
6.8 Find the Generated Header File
7 Setup nRF Connect Environment
7.1 Install VSCode nRF Connect Extension
7.2 Validating the nRF Toolchain in VS Code
7.3 Validating the nRF SDK in VS Code
7.4 Open hello_axon application
7.5 Setup Build Configuration
7.6 Build project
7.7 Flash board
7.8 Open Serial Port Connection
8. Create Axon NPU benchmark application
8.1 Prepare the benchmark application
8.2 Zephyr project main.c benchmark code
8.3 Copy.h compiled file
8.4 Update project configuration
8.5 Setup Build Configuration
8.6 Flash benchmark to the board
8.7 Open Serial Port Connection
9 Benchmark Comparing Axon NPU vs Cortex CPU Performace
9.1 Update Zephyr project main.c benchmark code
9.2 Analyze the results
The Conclusion: Around 24x-25x Faster
10 How to Design Good Benchmark
Nordic Semiconductor Official Sources

Published April 15, 2026 © Apache-2.0

nRF54L20B Axon NPU Benchmark on Zephyr

Bypass TFLM and unleash Nordic’s nRF54L Axon NPU. Compile bare-metal workloads and push the silicon to its limits with Zephyr RTOS.

IntermediateProtip2 hours320

Things used in this project

Hardware components

Nordic Semiconductor nRF54L20B

Software apps and online services

Microsoft VS Code

Nordic Semiconductor nRF Connect SDK

v3.3.0-preview2

Story

1. Introduction

Nordic Semiconductor released the nRF54L20B with a dedicated "Axon NPU, " claiming massive speedups.

I wanted to find the physical limit of the silicon.

How many MACs (Multiply-Accumulates) can this thing actually do in a single clock cycle?

2. Understanding the Axon NPU

2.1 Why the Axon NPU Uses INT8

The Axon NPU is not a floating-point processor. It is optimized for 8-bit integer INT8 operations.

To understand NPUs, you have to understand the physical cost of math on silicon. Neural networks are essentially giant grids of multiplications and additions (Multiply-Accumulate, or MAC operations).

Floating Point 32 (FP32): This is what models are trained on using massive data center GPUs. It is incredibly precise, but a hardware FP32 multiplier takes up a huge amount of physical silicon space and consumes a massive amount of electricity.
Integer 8 (INT8): An INT8 multiplier is tiny. You can pack dozens of them into the space of a single FP32 multiplier. It also consumes a fraction of the power.
The Memory Bottleneck: The biggest drain on a battery isn't doing the math; it's moving the data from the RAM to the processor. An INT8 weight is 1 byte. An FP32 weight is 4 bytes. By using INT8, you instantly quadruple your memory bandwidth and cut memory power consumption by 75%.

2.2 Why Floating Point Is Not Supported in Axon NPU Hardware

Therefore, the process of Quantization (squishing an FP32 model down to INT8) is the standard industry practice for running AI on any consumer device.

It is essentially a massive array of MAC (Multiply-Accumulate) units. In one clock cycle, it can perform dozens of multiplications and additions that would take a standard Cortex-M33 dozens of cycles to do sequentially.

To use it efficiently, engineers must quantize AI models.

Think of tasks where "close enough" is fine. This way you don't need high-precision floating point to detect a cough, a gesture, or a vibration pattern. In embedded or energy efficient edge-AI software engineers decided to trade a tiny bit of mathematical precision for a 10x-15x energy saving.

Because Nordic chips run on coin-cell batteries, the Axon NPU has zero hardware support for Floating Point math. Silicon space is too precious.

If you try to run a model with floating-point layers on the nRF54LM20B, the NPU simply refuses and passes the math back to the standard Arm Cortex-M33 CPU, which destroys power efficiency and speed.

2.3 Wake-on-AI Architecture

In a normal MCU, the CPU and the additional modules like SIM-module often fight over the same bus to get to the RAM. This creates "bottlenecks."

The Axon NPU has its own local memory/cache and a dedicated path to the system RAM. It uses a specialized DMA (Direct Memory Access) controller to pull data (like audio samples or IMU readings) without asking the CPU for help.

The NPU has "hardwired" logic for specific neural network layers. If your model uses these, it flies. If it uses others, it falls back to the slow CPU.

Supported Layers and CPU Fallback

The most important "gut" feature is the Hardware Trigger.

By doing this you can set up a chain:

Sensor -> DMA -> Axon NPU -> Trigger

The Cortex-M33 CPU can be in Deep Sleep (consuming almost zero power) while the NPU is "watching." Only when the NPU's inference result exceeds a confidence threshold (e.g., "I am 90% sure that was a dog barking") does it "wiggle a pin" to wake up the CPU.

This is "Wake-on-AI.

2.4 Typical Use Cases for the Axon NPU

It was invented for AI workload things that should be battery powered for 5 years.

A forest fire sensor that listens for the specific sound of a chainsaw or a tree falling.

A leak detector that "listens" to the ultrasonic hiss of a high-pressure pipe leak.

2.5 TFLM Integration and the Compiler Pipeline

The Axon NPU doesn't run C code directly; it runs a compiled "graph."

Nordic provides a compiler that takes a .tflite file and turns it into an Instruction Stream for the NPU.

Because it uses the TFLM ecosystem, you can use standard tools Edge Impulse, TensorFlow, Keras.

3 Theoretical Performance and Architecture Background

To benchmark the Axon NPU on the nRF54LM20B, I needed to look at both the theoretical "peak" numbers and the "effective" performance delivered by the compiler.

Because the Axon NPU is a proprietary architecture (acquired from Atlazo), Nordic expresses its performance primarily through relative speedups (e.g., "15x faster than CPU").

However, based on technical specifications and architecture analysis, my tutorial shows how I estimated and measured these values.

3.1 Axon NPU Technical data

Frequency: 128 MHz (0.128 GHz).
MACs/cycle: While Nordic has not officially published the raw gate-level MAC count in the public datasheet, architecture analysis of the Atlazo Axon lineage and Nordic’s "15x speedup" claim suggests the core is optimized for parallel INT8 processing.

3.2 Compute Throughput in GOPS

Usually Giga Operations Per Second (GOPS) value is calculated as:

GOPS=2×MACs per cycle×Clock Frequency (GHz)

Note: I multiply by 2 because 1 MAC (Multiply-Accumulate) is considered 2 operations (1 multiply + 1 addition).

3.3 Atlazo Axon RDLA Architecture

The Axon NPU is based on a Reconfigurable Deep Learning Accelerator (RDLA) architecture. Unlike a generic DSP, it is a Domain-Specific Architecture (DSA).

It does not use standard RISC-V or ARM instructions. It uses a custom Instruction Stream generated by the Axon Compiler.

It is a "Stream Processor." It minimizes data movement by keeping weights and intermediate "feature maps" in local high-speed buffers (SRAM) rather than constantly hitting the system RAM.

The original Atlazo architecture (specifically the AZ-N1 lineage) featured a scalable MAC array. For the nRF54L series, Nordic has tuned this for the "Sweet Spot" of ultra-low power:

Precision: Optimized for INT8. It supports "sub-byte" quantization (INT4/INT1) in the silicon, though the current Nordic toolchain focuses on INT8 for stability.
Parallelism: Based on Atlazo’s design specs, the architecture is capable of performing 64 to 128 MACs per cycle depending on the configuration.

Based on the Atlazo IP heritage:

MACs / Cycle & GOPS: The Atlazo Axon Reconfigurable Deep Learning Accelerator (RDLA) was originally designed around a highly parallel MAC array optimized for sub-milliwatt power. While Nordic hasn't published the exact MAC array size for the nRF54L20’s specific silicon implementation, we will determine the actual specs by the end of this tutorial; typically, NPUs in this power class (drawing ~3mA active) utilize a 32 or 64 MACs/cycle array. At the system clock speed of 128 MHz, this yields a theoretical peak of roughly 8 to 16 GOPS (Giga Operations Per Second)[Prnnewswire.com Nordic News].
Native Hardware Acceleration: The compiler maps specific mathematical operations directly to the Axon silicon. If your model uses these, it achieves the 15x speedup.

They include:

Conv1D and Conv2D (with specific acceleration for pointwise)[Nordic Academy nRF54L]
Depthwise Convolution (without channel multipliers)
Fully Connected (Dense) layers
Pooling (Max / Average)
Native hardware activation functions: ReLU, ReLU6, and LeakyReLU.
CPU Fallback: Operations like Softmax, Tanh, or Sigmoid are not natively accelerated by the Axon silicon and are transparently handed back to the Cortex-M33 CPU by the Axon compiler. If your model relies heavily on these, your benchmarked GOPS will drop significantly.
Data Types: The NPU is strictly optimized for INT8 quantized models (with options for INT32 model output).

3.4 Official Performance Claims

Official Performance Metrics for nRF54LM20B:

Speedup vs. CPU: The Axon NPU executes AI inference up to 15x faster than running the exact same model on the integrated 128 MHz Arm Cortex-M33.
Energy Efficiency vs. CPU: Running models on the Axon NPU is up to 10x more energy-efficient than the Cortex-M33. I can't prove that because I don't have Nordic's Power Profiler Kit II.
Competitive Benchmark: Nordic states the Axon NPU delivers up to 7x higher performance and 8x better energy efficiency compared to "the closest competing edge AI solution" operating in similar power budgets.

I can't provide a definitive comparison or proof yet; the only thing we will determine today is how the NPU's speed compares to the CPU's.

3.5 Nordic Specific Tools

Nordic provides specific tools to extract the exact MAC utilization and efficiency for custom models. I do not need to calculate GOPS manually:

Nordic Edge AI Lab (Released Jan 2026): This web-based tool allows to build models (or use their Text-to-Wake-Word generator) and directly compile them for the Axon NPU. It provides estimated latency and footprint benchmarks before you flash the device.
Edge AI Add-on for nRF Connect SDK (v2.0+): This includes the Axon Compiler. When pass a .tflite model through it, it generates a profiling report showing exactly which layers were hardware-accelerated and the precise execution time in milliseconds. I will use this add-on for benchmarking and working with the Axon NPU.
PC Simulator: Included in the SDK, this allows to get performance and latency estimates for the NPU without needing the physical nRF54LM20B development kit. I will use this PC simulator to learn more, as it seems like a very interesting tool.

Whether I'm looking at a tiny 9-milliwatt Nordic Axon NPU or a 30-watt Intel Core NPU inside a flagship laptop, they all share the exact same optimization pattern: relying on INT8 (8-bit integer) math for AI inference.

4. Building the "MAC-Smasher" Benchmark Model

Note: I'm using Mac OSTahoe 26.4

You will need a standard Python environment with TensorFlow installed.

4.1 Setup Python Environment

Download Miniforge for macOS:

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh"

Then install it:

bash Miniforge3-MacOSX-arm64.sh

Follow the prompts and type "yes" to initialize it, then close and reopen your terminal.

Accept the license terms by typing "yes"

Create an environment with a perfectly stable, slightly older Python (like 3.11):

conda create -n npu_env python=3.11 -y

Now activate it:

conda activate npu_env

Now install standard TensorFlow:

pip install tensorflow numpy

4.2 Generate the Strict INT8 TFLite Model

To test the NPU’s absolute limit, I needed a pure math workload. I decided on a 256x256 Fully Connected (Dense) layer.

It requires exactly 65536 MAC operations and results in exactly 64KB of weights. Small enough to fit perfectly in the nRF54L's ultra-fast internal SRAM without bottlenecking the main RAM.

For such a purpose I wrote a short Python script called it generate_benchmark_model.py

mkdir <your-folder-with-diy-projects>/npu-model-gen-script
cd npu-model-gen-script
touch generate_benchmark_model.py

Using TensorFlow/Keras to generate a strict INT8 quantized model axon_mac_smasher_int8.tflite

Paste the content below into the generate_benchmark_model.py file.

import tensorflow as tf
import numpy as np

# 1. Define the exact math we want to perform
# Input: Array of 256 numbers.
# Dense Layer: 256 neurons. 
# Total MACs = 256 (inputs) * 256 (neurons) = 65,536 MAC operations.
inputs = tf.keras.Input(shape=(256,))

# We set use_bias=False to keep the math purely Multiplication and Addition (MAC)
# We use 'relu' because Axon natively accelerates it in hardware
x = tf.keras.layers.Dense(256, use_bias=False, activation='relu')(inputs)

model = tf.keras.Model(inputs=inputs, outputs=x)

# 2. Create a "Representative Dataset"
# This is MANDATORY for INT8 quantization. TensorFlow needs dummy data 
# to figure out how to scale the floating-point numbers into 8-bit integers.
def representative_data_gen():
    for _ in range(100):
        # Generate random float32 data between -1.0 and 1.0
        yield[np.random.uniform(-1.0, 1.0, size=(1, 256)).astype(np.float32)]

# 3. Configure the TFLite Converter for STRICT INT8
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen

# Force the math to be INT8 only (No float fallbacks!)
converter.target_spec.supported_ops =[tf.lite.OpsSet.TFLITE_BUILTINS_INT8]

# Force the Input and Output pins to be INT8
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

# 4. Convert and Save
tflite_model = converter.convert()

with open('axon_mac_smasher_int8.tflite', 'wb') as f:
    f.write(tflite_model)

print("Model successfully generated: axon_mac_smasher_int8.tflite")
print("Total Expected MACs: 65,536")

The NPU only speaks 8-bit integer math, so forcing INT8 is crucial to prevent the NPU from rejecting the model and handing it back to the slower CPU.

By doing this, I control the exact mathematical size of the model, and force strict INT8 quantization so the Axon NPU accepts 100% of the operations without falling back to the CPU.

4.3 Run the Model Generation Script

Save this script as generate_benchmark_model.py and run it

(npu_env)% python3 generate_benchmark_model.py

It is not broken, just wait patiently:

You will get output smth like

Saved artifact at '/var/folders/0n/nfvd_scd21d7hldt9n17_bf00000gn/T/tmp8zemt7la'. The following endpoints are available:
Endpoint 'serve'
args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 256), dtype=tf.float32, name='keras_tensor')
Output Type:
TensorSpec(shape=(None, 256), dtype=tf.float32, name=None)
Captures:
4817651088: TensorSpec(shape=(), dtype=tf.resource, name=None)
/Users/maxxlife/miniforge3/envs/npu_env/lib/python3.11/site-packages/tensorflow/lite/python/convert.py:846: UserWarning: Statistics for quantized inputs were expected, but not specified; continuing anyway.
warnings.warn(
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1774728156.181161  453968 tf_tfl_flatbuffer_helpers.cc:364] Ignored output_format.
W0000 00:00:1774728156.181173  453968 tf_tfl_flatbuffer_helpers.cc:367] Ignored drop_control_dependency.
I0000 00:00:1774728156.181417  453968 reader.cc:83] Reading SavedModel from: /var/folders/0n/nfvd_scd21d7hldt9n17_bf00000gn/T/tmp8zemt7la
I0000 00:00:1774728156.181547  453968 reader.cc:52] Reading meta graph with tags { serve }
I0000 00:00:1774728156.181551  453968 reader.cc:147] Reading SavedModel debug info (if present) from: /var/folders/0n/nfvd_scd21d7hldt9n17_bf00000gn/T/tmp8zemt7la
I0000 00:00:1774728156.182224  453968 mlir_graph_optimization_pass.cc:437] MLIR V1 optimization pass is not enabled
I0000 00:00:1774728156.182334  453968 loader.cc:236] Restoring SavedModel bundle.
I0000 00:00:1774728156.187968  453968 loader.cc:220] Running initialization op on SavedModel bundle at path: /var/folders/0n/nfvd_scd21d7hldt9n17_bf00000gn/T/tmp8zemt7la
I0000 00:00:1774728156.189636  453968 loader.cc:471] SavedModel load for tags { serve }; Status: success: OK. Took 8222 microseconds.
I0000 00:00:1774728156.195318  453968 dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
fully_quantize: 0, inference_type: 6, input_inference_type: INT8, output_inference_type: INT8
W0000 00:00:1774728158.375938  453968 flatbuffer_export.cc:3851] Skipping runtime version metadata in the model. This will be generated by the exporter.
Model successfully generated: axon_mac_smasher_int8.tflite
Total Expected MACs: 65,536
Endpoint 'serve'
args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 256), dtype=tf.float32, name='keras_tensor')
Output Type:
TensorSpec(shape=(None, 256), dtype=tf.float32, name=None)
Captures:
4817651088: TensorSpec(shape=(), dtype=tf.resource, name=None)
/Users/maxxlife/miniforge3/envs/npu_env/lib/python3.11/site-packages/tensorflow/lite/python/convert.py:846: UserWarning: Statistics for quantized inputs were expected, but not specified; continuing anyway.
warnings.warn(
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1774728156.181161  453968 tf_tfl_flatbuffer_helpers.cc:364] Ignored output_format.
W0000 00:00:1774728156.181173  453968 tf_tfl_flatbuffer_helpers.cc:367] Ignored drop_control_dependency.
I0000 00:00:1774728156.181417  453968 reader.cc:83] Reading SavedModel from: /var/folders/0n/nfvd_scd21d7hldt9n17_bf00000gn/T/tmp8zemt7la
I0000 00:00:1774728156.181547  453968 reader.cc:52] Reading meta graph with tags { serve }
I0000 00:00:1774728156.181551  453968 reader.cc:147] Reading SavedModel debug info (if present) from: /var/folders/0n/nfvd_scd21d7hldt9n17_bf00000gn/T/tmp8zemt7la
I0000 00:00:1774728156.182224  453968 mlir_graph_optimization_pass.cc:437] MLIR V1 optimization pass is not enabled
I0000 00:00:1774728156.182334  453968 loader.cc:236] Restoring SavedModel bundle.
I0000 00:00:1774728156.187968  453968 loader.cc:220] Running initialization op on SavedModel bundle at path: /var/folders/0n/nfvd_scd21d7hldt9n17_bf00000gn/T/tmp8zemt7la
I0000 00:00:1774728156.189636  453968 loader.cc:471] SavedModel load for tags { serve }; Status: success: OK. Took 8222 microseconds.
I0000 00:00:1774728156.195318  453968 dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
fully_quantize: 0, inference_type: 6, input_inference_type: INT8, output_inference_type: INT8
W0000 00:00:1774728158.375938  453968 flatbuffer_export.cc:3851] Skipping runtime version metadata in the model. This will be generated by the exporter.
Model successfully generated: axon_mac_smasher_int8.tflite
Total Expected MACs: 65,536

The most important is at the end:

Model successfully generated: axon_mac_smasher_int8.tflite
Total Expected MACs: 65,536

Notice this exact line in your output:

input_inference_type: INT8, output_inference_type: INT8

This confirms the model is strictly locked to 8-bit integer math for the Axon NPU.

Right now, that .tflite file is sitting on my Mac's hard drive. To get it onto the nRF54LM20B using Zephyr, I need to convert it into a C-array so the Axon compiler can bake it into the chip's Flash memory.

4.5 Why 256×256 Is the Right Test Size

I think it is important to explain a couple of important details from the Python code.

I specifically chose 256 for the input and the neurons (resulting in a matrix) because of four strict hardware rules when benchmarking an NPU.

# Total MACs = 256 (inputs) * 256 (neurons) = 65,536 MAC operations.

Here is why 256 is the "magic number" for this exact test:

1. It creates exactly 64 Kilobytes of Weights (The SRAM Sweet Spot)

In an INT8 model, every single weight (connection between neurons) takes exactly 1 byte of memory.

The nRF54LM20B has 512 KB of total RAM, but the Axon NPU has its own internal, ultra-fast local cache (Tightly Coupled Memory) so it doesn't have to wait for the main system bus. 64 KB is small enough to fit entirely inside this ultra-fast memory.

If I chose 1024x1024 (1 Megabyte): It wouldn't fit in the chip.
If I chose 512x512 (256 KB): It might force the NPU to fetch data from the slower main RAM, meaning you are benchmarking the memory speed, not the math speed. 64 KB guarantees we are purely testing the silicon math engine.

2. Perfect Hardware Alignment (Powers of 2)

NPUs are designed with physical MAC arrays sized in powers of 2 (usually 16, 32, or 64 MACs working in parallel).If I give an NPU an odd number, like an input size of 250, the hardware still has to load 256 bytes into its 32-byte memory lanes, filling the empty spaces with zeros (padding). This wastes clock cycles and ruins the math.

Because 256 is perfectly divisible by 16, 32, 64, and 128, the NPU hardware operates at absolute 100% efficiency with zero wasted cycles.

3. Hiding the "Wake-Up" Overhead

When Zephyr RTOS tells the NPU to start, there is a slight delay (setup time, DMA configuration, triggering the hardware interrupt). This overhead might take a few dozen clock cycles.

If I chose 16x16 (256 MACs): The NPU would finish the math so fast that my code's Zephyr timer would mostly just be measuring the OS overhead, giving me an artificially slow benchmark.
At 256x256 (65, 536 MACs): The math takes long enough that the tiny OS overhead becomes a statistical rounding error, giving me the true hardware speed.

4. Math

When I will run the Zephyr benchmark and look at the total clock cycles, 65, 536 is incredibly easy to divide in your head to find the physical MAC array size:

If it takes ~2, 000 cycles:
If it takes ~1, 000 cycles:

So, 256 is carefully calculated to perfectly flood the Axon NPU's data pipes. I hope if Axon engineers will read it, maybe they could comment it out, and proof my assumptions.

Later we will run this .tflite model through the Nordic Axon compiler.

5. Preparing the Nordic Toolchain and Hardware

5.1 Connect the nRF54LM20-DK

Connect the development kit to the Debugger USB port and move the power switch to the ON position.

I removed the background using AI, which caused some of the text on the PCB to be hallucinated.

5.2 Open nRF Connect for Desktop

Download it and install from here:

https://www.nordicsemi.com/Products/Development-tools/nRF-Connect-for-Desktop/Download?lang=en#infotabs

Open Board Configurator.

Select device.

Select connected board.

Now, you can review the entire board configuration.

Here is my board version.

Now, close the Board Configurator and return to the Apps window. Open Toolchain Manager.

Here is the list of available toolchains. You will need to install nRF Connect SDK v3.0.0-preview2 using the CLI. Clicking the Install with CLI button will redirect you to the nRF documentation; however, since their documentation can be a nightmare, I have provided a shortcut below in 5.3

1 / 3

Follow these commands to avoid digging through their pages.

5.3 Verify the nrfutil Installation

Enter the following command:

nrfutil list

Review the output:

Command Version Description
completion  1.5.0
sdk-manager 1.11.0  Install and manage SDK and toolchain installations for the nRF Connect SDK.
Found 2 installed command(s)

If you don't have nrfutil, download and install the appropriate version by following this link: https://www.nordicsemi.com/Products/Development-tools/nRF-Util/Download?lang=en#infotabs

5.4 Install the sdk-manager

Enter the following command:

nrfutil install sdk-manager

Review the output:

nrfutil-sdk-manager already installed - use '--force' to uninstall and reinstall the command
[00:00:00] ###### 100% [Install packages] Install packages

5.5 Install the Required nRF Connect SDK Preview

Enter the following command:

nrfutil sdk-manager install v3.3.0-preview3

Review the output:

Using the global server for downloads. Use '--region cn' for the server in Mainland China.
[00:00:02] ------  0% [Download toolchain v3.3.0-preview3] Downloading toolchain

SDK installation may take some time, so please be patient and carefully review the output for any errors.

Might take up to 1 hour to finish.

% nrfutil sdk-manager install v3.3.0-preview3
Using the global server for downloads. Use '--region cn' for the server in Mainland China.
[00:16:01] ###### 100% [Download toolchain v3.3.0-preview3] Toolchain downloaded
[00:00:18] ###### 100% [Unpack toolchain v3.3.0-preview3] Toolchain unpacked to /opt/nordic/ncs/tmp/.tmpMLCL43
[00:00:01] ###### 100% [Install toolchain v3.3.0-preview3] Toolchain installed at /opt/nordic/ncs/toolchains/e023efa4ae
[00:55:30] ###### 100% [Download SDK v3.3.0-preview3] Download SDK v3.3.0-preview3
[00:00:06] ###### 100% [Calculating SDK checksum] Verified download
[00:00:33] ###### 100% [Unpack SDK v3.3.0-preview3] Unpacked SDK tarball

5.6 Verify the SDK Installation

Once complete, the latest toolchain will be installed on your macOS:

/opt/nordic/ncs/toolchains/<toolchain-hash-number-version>

the latest SDK will be installed on your macOS:

/opt/nordic/ncs/<sdk-number-version>

To perform a final check and verify your successful installations, run the following command:

% nrfutil sdk-manager list

Review the output:

SDK Type SDK Version SDK Status Toolchain Status
nrf  v3.3.0-preview3 Installed  Installed

6 Compiling the Model for the Axon NPU

Instead of the CPU reading a standard TensorFlow array, Nordic engineers created the Edge AI Add-on.

The Axon NPU cannot read a standard TensorFlow.tflite array. It only speaks its own proprietary "hardware machine code."

Nordic built a secret compiler into Zephyr. When you hit "Build", Zephyr intercepts your .tflite file, translates the TensorFlow math into Axon hardware instructions, and generates its own special .h file containing NPU-specific memory pointers (not just raw bytes).

You have your axon_mac_smasher_int8.tflite file. Instead of using xxd to turn it into a flat C-array, you must run it through the Axon NPU Compiler. This compiler is included in the Edge AI Add-on you just installed.

6.1 Create an Edge AI Workspace

Run the following commands to setup folder:

cd <your-folder-with-diy-projects>
mkdir axon-edge-ai
cd axon-edge-ai

Launch installed toolchain::

nrfutil sdk-manager toolchain launch --ncs-version v3.3.0-preview3 --shell

Review the output:

Initializing shell environment!
(v3.3.0-preview3) macbook-maxxlife%

6.2 Set ZEPHYR_BASE

Export the correct environment variable ZEPHYR_BASE

export ZEPHYR_BASE=<your-folder-with-diy-projects>/axon-edge-ai

For example I have:

export ZEPHYR_BASE=/Users/maxxlife/github_repos/axon-edge-ai

6.3 Initialize the Workspace with west

Now initialize Zephyr repo with west:

west init -m https://github.com/nrfconnect/sdk-edge-ai

Update all repositories, by running the following command:

west update

Review the output:

=== updating zephyr (zephyr):
--- zephyr: initializing
Initialized empty Git repository in /Users/maxxlife/github_repos/axon-edge-ai/zephyr/.git/
--- zephyr: fetching, need revision 4b6df5ff11b1a10a2ffa89dab7450c0af98c9e3a
remote: Enumerating objects: 1510594, done.
remote: Counting objects: 100% (411/411), done.
remote: Compressing objects: 100% (245/245), done.
Receiving objects:  9% (135954/1510594), 40.94 MiB | 4.59 MiB/

Now, West will begin the workspace initialization. This process may take some time—typically around 15–20 minutes. You can monitor the workspace setup progress and view specific details directly in the terminal.

Now, exit the shell nrfutil by simply typing exit:

(v3.3.0-preview3) macbook-maxxlife% exit

to get back to the environment:

(npu_env_test)

6.4 Activate Conda Virtual Environment

Since you are using the nRF Connect SDK, you have the translator built in.

Make sure that you are stil in activated Conda virtual environment terminal, if no, run:

conda activate npu_env

6.5 Install the Compiler Requirements

After west update check files in the folder using ls

(v3.3.0-preview3) macbook-maxxlife% ls
bootloader edge-ai  modules  nrf  nrfxlib  test  tools  zephyr

Make sure you are still inside the tools/axon/compiler directory.

cd edge-ai/tools/axon/compiler

Now that your Conda environment is active, tell it to install whatever extra math libraries Nordic needs for this script:

pip install -r scripts/requirements.txt

Make sure you are still in your terminal inside the edge-ai/tools/axon/compiler directory.

cd <your-folder-with-diy-projects>/axon-edge-ai/edge-ai/tools/axon/compiler/

We are going to create a brand new, minimal file called config.yaml.

Run this Python command in your terminal to generate config.yaml the file and paste the exact configuration into it:

Note: Ensure that you replace the tflite_model path with the correct one for your local system!

python -c '
import yaml
data = {
    "mac_smasher_benchmark": {
        "model_name": "axon_mac_smasher_int8",
        "tflite_model": "/Users/maxxlife/github_repos/npu-model-gen-script/axon_mac_smasher_int8.tflite",
        "interlayer_buffer_size": 70000,
        "log_level": "info"
    }
}
with open("config.yaml", "w") as f:
    yaml.dump(data, f)
print("config.yaml generated perfectly!")
'

6.6 Run the Axon Compiler

Enter the following command:

python scripts/axons_ml_nn_compiler_executor.py config.yaml

6.7 Check the Compiler Output

INFO: running mac_smasher_benchmark
INFO: starting compiler...
INFO: model name is : axon_mac_smasher_int8
INFO: test data or proper test labels are not provided, we have the tflite file.
INFO: running for variant axon_mac_smasher_int8
/Users/maxxlife/miniforge3/envs/npu_env/lib/python3.11/site-packages/cffi/cparser.py:436: UserWarning: #pragma in cdef() are entirely ignored. They should be removed for now, otherwise your code might behave differently in a future version of CFFI if #pragma support gets added. Note that '#pragma pack' needs to be replaced with the 'packed' keyword argument to cdef().
warnings.warn(
INFO: compiler intermediate outputs are saved in : /Users/maxxlife/github_repos/axon-edge-ai/edge-ai/tools/axon/compiler/intermediate/
executing compiler object at /Users/maxxlife/github_repos/axon-edge-ai/edge-ai/tools/axon/compiler/scripts/../bin/Darwin/libnrf-axon-nn-compiler-lib-arm64.dylib
open Compiler I/O files:
input model desc file: intermediate/nrf_axon_model_axon_mac_smasher_int8_bin_.bin
binary file version: 0x00001100
fully connected 1 x 256 into 256 x 256...done.
command_buf size=6632
model constants buffer size=66560
interlayer buffer size: provisioned 70000, needed 1280
Psum buffer size: provisioned 0, needed 0
Persistent Vars buffer size needed 0
Compiler generating model.h file
run_c_compiler_lib-INFO: model compiled successfully ....
run_c_compiler_lib-INFO: not running inference as test vectors are not provided
done executing compiler object at /Users/maxxlife/github_repos/axon-edge-ai/edge-ai/tools/axon/compiler/scripts/../bin/Darwin/libnrf-axon-nn-compiler-lib-arm64.dylib
INFO: done running variant axon_mac_smasher_int8, return code 0
INFO:
Performance_Metrics
Model_Variant Inference_Time Accuracy Precision Quantization_Loss Flash_Size Ram_Size Total_Memory
axon_mac_smasher_int8 None None None None 0 0 0
INFO: completed running mac_smasher_benchmark, return code 0(NRF_AXON_COMPILER_RESULT_SUCCESS), took 0.88 seconds

6.8 Find the Generated Header File

The .h file will be located in the outputs folder

<your-folder-with-diy-projects>/axon-edge-ai/edge-ai/tools/axon/compiler/outputs/nrf_axon_model_axon_mac_smasher_int8_.h

Later, you will need to copy this .h file from the output directory to the /src project folder.

7 Setup nRF Connect Environment.

7.1 Install VSCode nRF Connect Extension

Ensure that you have the nRF Connect Extension installed in VS Code from Extensions Marketplace.

7.2 Validating the nRF Toolchain in VS Code

Open the nRF Connect Extension and verify that your SDKs and Toolchains are correctly configured.

Click tab Manage Toolchains.

Navigate to this section and click Validate Toolchain to ensure everything is set up correctly.

Select the specific toolchain version installed for this tutorial. Since I am working on multiple projects, I have several toolchains installed, but you should choose the one we just configured.

After selecting the toolchain to validate, the results will appear in the bottom-right corner of VS Code.

7.3 Validating the nRF SDK in VS Code

Follow the same steps to validate the SDK.

7.4 Open hello_axon application

First, let's build the default hello_axon application to verify that everything is configured correctly.

Select and existing application folder.

Then choose path to the created workspace previously samples folder.

The path will be in my case

/Users/maxxlife/github-repos/axon-edge-ai/edge-ai/samples/axon/hello-axon

In your case it will smth like this

<your-folder-with-diy-projects>/axon-edge-ai/edge-ai/samples/axon/hello-axon

Click Open to load the project workspace.

You will be automatically redirected to the VS Code Explorer view.

Now get back and open nRF Connect view to build the project.

Once you see the hello_axon project added to the list, you can proceed to configure and build it.

7.5 Setup Build Configuration

Click Add build configuration, then ensure you select the correct SDK and Toolchain versions from the dropdown menus.

7.6 Build project

Finally, click the Generate and Build button to generate the build files and compile the application.

7.7 Flash board

Click Flash to run west flash and upload program to the board.

7.8 Open Serial Port Connection

To view the application output, you must have the serial port open.

Choose VCOM1

Select connected device.

Now, press the physical Reset button on your hardware to restart the program and view the output in the terminal.

Since the hello_axon sample is running successfully, we can use it as a foundation for our new benchmark application.

8. Create Axon NPU benchmark application

Now you need to create Zephyr Benchmark Application.

8.1 Prepare the benchmark application

The best approach is to build your new application directly from the hello_axon sample. This ensures all environment variables and dependencies remain intact.

I recommend copying the folder hello_axon and renaming it to axon_benchmark to keep your workspace organized.

Open the CMakeLists.txt file in your new axon_benchmark folder and update the project name.

# Zephyr CMake project
find_package(Zephyr REQUIRED HINTS $ENV{ZEPHYR_BASE})
project(axon_benchmark)

8.2 Zephyr project main.c benchmark code

Update main.c file with this code.

#include <drivers/axon/nrf_axon_driver.h>
#include <drivers/axon/nrf_axon_nn_infer.h>
#include <axon/nrf_axon_platform.h>
#include <zephyr/logging/log.h>
#include <zephyr/timing/timing.h>

// Include the generated file (with the trailing underscore!)
#include "nrf_axon_model_axon_mac_smasher_int8_.h"

LOG_MODULE_REGISTER(axon_benchmark);

int main(void)
{
    nrf_axon_result_e result;

    LOG_INF("--- Axon NPU MAC Benchmark ---");

    // 1. Initialize the NPU Platform
    if (nrf_axon_platform_init() != NRF_AXON_RESULT_SUCCESS) {
        LOG_ERR("Platform init failed");
        return -1;
    }

    // 2. Initialize Model Variables
    if (nrf_axon_nn_model_init_vars(&model_axon_mac_smasher_int8) != 0) {
        LOG_ERR("Model var init failed");
        return -1;
    }

    // 3. Validate the Model
    if (nrf_axon_nn_model_validate(&model_axon_mac_smasher_int8) != NRF_AXON_RESULT_SUCCESS) {
        LOG_ERR("Model validation failed!");
        return -1;
    }

    // 4. Prepare Input Data (Fill 256 array with dummy data)
    int8_t *input_buffer = model_axon_mac_smasher_int8.inputs[0].ptr;
    for (int i = 0; i < 256; i++) {
        input_buffer[i] = 1;
    }

    int8_t output_buffer[256];

    LOG_INF("Model Loaded. Expected MACs: 65,536");
    LOG_INF("Starting 10-pass benchmark...");

    // 5. Setup High-Res Timer
    timing_init();
    timing_start();

    uint64_t total_cycles = 0;
    const int num_runs = 10;

    // 6. Run the Benchmark
    for (int i = 0; i < num_runs; i++) {
        timing_t start_time = timing_counter_get();
        
        // BOOM! Fire the NPU!
        result = nrf_axon_nn_model_infer_sync(&model_axon_mac_smasher_int8, input_buffer, output_buffer);
        
        timing_t end_time = timing_counter_get();
        
        if (result != NRF_AXON_RESULT_SUCCESS) {
            LOG_ERR("Inference failed on run %d", i);
            return -1;
        }
        
        total_cycles += timing_cycles_get(&start_time, &end_time);
    }

    timing_stop();

    // 7. Calculate Results
    uint64_t average_cycles = total_cycles / num_runs;
    
    // Multiply by 100 to print two decimal places cleanly
    uint64_t macs_x_100 = (65536 * 100) / average_cycles; 

    LOG_INF("Average Cycles per Inference: %llu", average_cycles);
    LOG_INF("Calculated Hardware MACs/Cycle: %llu.%02llu", macs_x_100 / 100, macs_x_100 % 100);
    LOG_INF("------------------------------");

    return 0;
}

8.3 Copy.h compiled file

Copy the compiled nrf_axon_model_axon_mac_smasher_int8.h file created in 6.8 to the /src folder so that main.c can detect the headers.

cp <your-folder-with-diy-projects>/axon-edge-ai/edge-ai/tools/axon/compiler/outputs/nrf_axon_model_axon_mac_smasher_int8_.h <your-folder-with-diy-projects>/axon-edge-ai/edge-ai/samples/axon/axon_benchmark

8.4 Update project configuration

CONFIG_NCS_SAMPLES_DEFAULTS=y

# Enable NPU
CONFIG_NRF_AXON=y

# Set buffer to what the compiler requested
CONFIG_NRF_AXON_INTERLAYER_BUFFER_SIZE=1280
CONFIG_NRF_AXON_PSUM_BUFFER_SIZE=0

# Enable Timing APIs for benchmarking
CONFIG_TIMING_FUNCTIONS=y
CONFIG_PICOLIBC_IO_FLOAT=y

8.5 Setup Build Configuration

Use the same build configuration as the "Hello World" sample.

8.6 Flash benchmark to the board

Flash the compiled model to your nRF54LM20B dev kit.

8.7 Open Serial Port Connection

Open serial port again and view the application serial output:

*** Booting nRF Connect SDK v3.3.0-preview2-ede152ec210b ***
*** Using Zephyr OS v4.3.99-4b6df5ff11b1 ***
I: --- Axon NPU MAC Benchmark ---
I: Model Loaded. Expected MACs: 65,536
I: Starting 10-pass benchmark...
I: Average Cycles per Inference: 16110
I: Calculated Hardware MACs/Cycle: 4.06

I flashed the board, opened the serial terminal, and saw the result: 16,110 average cycles, which mathematically equates to 4.06 MACs per cycle.

My Conclusion: Why exactly ~4? I think because it hit the Memory Wall. The internal memory bus on the Cortex-M33 architecture is 32 bits (4 bytes) wide. Since the Dense layer required 1 byte per weight, the NPU was physically limited to pulling 4 weights out of memory per clock cycle.

But it is just my assumption, and I need Nordic engineers to help figure that out.

The Takeaway: I think, the benchmark proved the Axon NPU's Direct Memory Access (DMA) and pipelining are so incredibly optimized that it crunches data at 100% of the physical speed limit of the copper traces on the TSMC 22nm silicon.

9 Benchmark Comparing Axon NPU vs Cortex CPU Performace

9.1 Update Zephyr project main.c benchmark code

If I ran this exact same 256x256 benchmark on the standard 128 MHz Cortex-M33 CPU using C-code, it wouldn't get 4 MACs per cycle.

The CPU has to load the instruction, load the data pointer, fetch the data, increment the pointer, multiply, add, and store.
A highly optimized CPU usually achieves about 0.3 to 0.5 MACs per cycle on Dense layers.!
By hitting a sustained 4.06, Axon NPU just proved Nordic's marketing claim: It is ~10x to 15x faster than the CPU!

Let's slightly update our code to see how much faster the Axon NPU handles this task compared to the Cortex CPU. Replace the existing code with the following:

#include <drivers/axon/nrf_axon_driver.h>
#include <drivers/axon/nrf_axon_nn_infer.h>
#include <axon/nrf_axon_platform.h>
#include <zephyr/logging/log.h>
#include <zephyr/timing/timing.h>

// Include the generated file (with the trailing underscore!)
#include "nrf_axon_model_axon_mac_smasher_int8_.h"

LOG_MODULE_REGISTER(axon_benchmark);

// We want O3 so the CPU runs at maximum physical speed
__attribute__((noinline, optimize("O3")))
void run_cpu_dense_layer(const int8_t *input, const int8_t *weights, int32_t *output) {
    for (int out_idx = 0; out_idx < 256; out_idx++) {
        int32_t accumulator = 0;
        for (int in_idx = 0; in_idx < 256; in_idx++) {
            accumulator += input[in_idx] * weights[out_idx * 256 + in_idx];
        }
        output[out_idx] = accumulator; 
    }
}

int main(void)
{
    nrf_axon_result_e result;

    LOG_INF("--- Axon NPU MAC Benchmark ---");

    // 1. Initialize the NPU Platform
    if (nrf_axon_platform_init() != NRF_AXON_RESULT_SUCCESS) {
        LOG_ERR("Platform init failed");
        return -1;
    }

    // 2. Initialize Model Variables
    if (nrf_axon_nn_model_init_vars(&model_axon_mac_smasher_int8) != 0) {
        LOG_ERR("Model var init failed");
        return -1;
    }

    // 3. Validate the Model
    if (nrf_axon_nn_model_validate(&model_axon_mac_smasher_int8) != NRF_AXON_RESULT_SUCCESS) {
        LOG_ERR("Model validation failed!");
        return -1;
    }

    // 4. Prepare Input Data (Fill 256 array with dummy data)
    int8_t *input_buffer = model_axon_mac_smasher_int8.inputs[0].ptr;
    for (int i = 0; i < 256; i++) {
        input_buffer[i] = 1;
    }

    int8_t output_buffer[256];

    LOG_INF("Model Loaded. Expected MACs: 65,536");
    LOG_INF("Starting 10-pass benchmark...");

    // 5. Setup High-Res Timer
    timing_init();
    timing_start();

    uint64_t total_cycles = 0;
    const int num_runs = 10;

    // 6. Run the Benchmark
    for (int i = 0; i < num_runs; i++) {
        timing_t start_time = timing_counter_get();
        
        // BOOM! Fire the NPU!
        result = nrf_axon_nn_model_infer_sync(&model_axon_mac_smasher_int8, input_buffer, output_buffer);
        
        timing_t end_time = timing_counter_get();
        
        if (result != NRF_AXON_RESULT_SUCCESS) {
            LOG_ERR("Inference failed on run %d", i);
            return -1;
        }
        
        total_cycles += timing_cycles_get(&start_time, &end_time);
    }

    timing_stop();

    // 7. Calculate Results
    uint64_t average_cycles = total_cycles / num_runs;
    
    // Multiply by 100 to print two decimal places cleanly
    uint64_t macs_x_100 = (65536 * 100) / average_cycles; 

    LOG_INF("Average Cycles per Inference: %llu", average_cycles);
    LOG_INF("Calculated Hardware MACs/Cycle: %llu.%02llu", macs_x_100 / 100, macs_x_100 % 100);
    LOG_INF("------------------------------");

    timing_start(); // TURN THE STOPWATCH BACK ON!
LOG_INF("==============================");
    LOG_INF("--- Cortex-M33 CPU Benchmark ---");

    timing_start(); // Restart the timer we stopped!

    static int8_t cpu_input[256];
    static int8_t cpu_weights[65536];
    static int32_t cpu_output[256]; 

    // The rest of the code remains exactly the same...
    void (*volatile blind_cpu_func)(const int8_t *, const int8_t *, int32_t *) = run_cpu_dense_layer;
    for (int i = 0; i < 65536; i++) {
        cpu_weights[i] = (int8_t)(k_cycle_get_32() & 0xFF); 
    }
    for (int i = 0; i < 256; i++) {
        cpu_input[i] = (int8_t)(k_cycle_get_32() & 0xFF);
    }

    LOG_INF("CPU Model Loaded. Expected MACs: 65,536");
    
    uint64_t total_cpu_cycles = 0;
    int32_t dynamic_checksum = 0;

    for (int i = 0; i < num_runs; i++) {
        // Mutate the input
        cpu_input[0] = (int8_t)(k_cycle_get_32() & 0xFF);

        timing_t start_time = timing_counter_get();
        
        // Execute the blind jump! The compiler is helpless here.
        blind_cpu_func(cpu_input, cpu_weights, cpu_output);
        
        timing_t end_time = timing_counter_get();
        total_cpu_cycles += timing_cycles_get(&start_time, &end_time);

        dynamic_checksum += cpu_output[i % 256]; 
    }

    uint64_t average_cpu_cycles = total_cpu_cycles / num_runs;
    
    if (average_cpu_cycles == 0) average_cpu_cycles = 1;

    uint64_t cpu_macs_x_100 = (65536 * 100) / average_cpu_cycles; 

    LOG_INF("Average Cycles per Inference: %llu", average_cpu_cycles);
    LOG_INF("Calculated Hardware MACs/Cycle: %llu.%02llu", cpu_macs_x_100 / 100, cpu_macs_x_100 % 100);
    LOG_INF("------------------------------");
    
    uint64_t speedup_x_100 = (average_cpu_cycles * 100) / average_cycles;
    LOG_INF(">>> AXON NPU SPEEDUP: %llu.%02llux FASTER <<<", speedup_x_100 / 100, speedup_x_100 % 100);
    LOG_INF("==============================");
    LOG_INF("CPU Math Checksum: %d", dynamic_checksum);

    return 0;
}

Build and run the project again.

In the serial output, you’ll see the performance gap between the Axon NPU and the Cortex CPU while performing the same task. The difference is impressive.

9.2 Analyze the results

Your serial output should look something like this:

*** Booting nRF Connect SDK v3.3.0-preview2-ede152ec210b ***
*** Using Zephyr OS v4.3.99-4b6df5ff11b1 ***
I: --- Axon NPU MAC Benchmark ---
I: Model Loaded. Expected MACs: 65,536
I: Starting 10-pass benchmark...
I: Average Cycles per Inference: 15975
I: Calculated Hardware MACs/Cycle: 4.10
I: ------------------------------
I: ==============================
I: --- Cortex-M33 CPU Benchmark ---
I: CPU Model Loaded. Expected MACs: 65,536
I: Average Cycles per Inference: 395748
I: Calculated Hardware MACs/Cycle: 0.16
I: ------------------------------
I: >>> AXON NPU SPEEDUP: 24.77x FASTER <<<
I: ==============================
I: CPU Math Checksum: 12412

Look at that magnificent block of text. 24.77x FASTER!

I think, I have extracted the raw, undeniable physical truth of the nRF54LM20B silicon.

Let's break down the numbers:

1. The Cortex-M33 Reality (0.16 MACs/Cycle)

The CPU took 395,748 cycles to do 65,536 operations.
If you divide that out (395,748 / 65,536), it means the CPU took ~6 full clock cycles just to do one single MAC operation.
Why? I guess because for every single multiplication, the CPU had to:

Calculate the memory address.
Load the input byte over the bus.
Load the weight byte over the bus.
Multiply them.
Add to the accumulator.
Increment the loop counter and check if it should branch.

2. The Axon NPU Reality (4.06-4.10 MACs/Cycle)

The NPU took around 16k cycles.

I’m still not sure why there is always a slight difference in the cycle count—sometimes it's 16,110, and other times it's a bit more or less. AI suggests it's due to dynamic clock scaling, cache hits and misses, pipeline stalls, branch prediction, background interrupts, and bus contention. It’s likely right.

NPU didn't waste a single cycle on loop counters, branches, or instructions. It locked the 32-bit memory bus open and slammed 4 weights per clock cycle directly into its hardware math array, hitting the absolute physical "Speed of Light" allowed by the copper memory traces on the chip, probably.

The Conclusion: Around 24x-25x Faster

Nordic's marketing team claims the NPU is "up to 15x faster" because they benchmark it using full, complex models (like Audio Keyword Spotting) that have some CPU fallback overhead.

But I just built a pure, unadulterated bare-metal micro-benchmark. Stripped of all OS overhead, the raw silicon of the Axon NPU is actually nearly 25 times faster than the Cortex-M33 at crunching Dense AI layers!

The Goal: Benchmarking the new unreleased Axon NPU.
The Discovery: Finding out Nordic deleted TFLM and forces developer to use hardware machine code.
The Hack: Digging into the hidden Python compiler, feeding it a custom config.yaml, and extracting the raw .h file.
The Trap: Fighting ARM TrustZone (/ns vs Secure), GCC Compiler optimizations deleting the code, and a broken stopwatch. Yes, I didn't cover this stuff; you're just seeing the final working code without the backlog of my trials and errors.
The Climax: The side-by-side execution revealing the "Memory Wall" (4.06-4.10 limit) and the incredible 24-25x Speedup.

10 How to Design Good Benchmark

My Key Learnings about benchmarks design.

The "Red Line" of Benchmarking: Strict Isolation

The golden rule of benchmarking is Isolation.
You must measure only the specific hardware or software you are trying to test, and absolutely nothing else. If you do not perfectly isolate the target, you will end up measuring "noise."

In this journey, I had to systematically strip away the noise:

Software Noise: I threw away the TensorFlow Lite C++ interpreter because it has software overhead. I went "bare-metal" to measure the raw hardware.
Compiler Noise: Compilers (like GCC) are designed to "cheat" by skipping math if they know the answer ahead of time. I had to use unpredictable hardware clocks (entropy) and blind function pointers to isolate the silicon's actual execution from the compiler's shortcuts.
Measurement Noise: Using an OS-level timer (like a 32kHz sleep timer) adds milliseconds of noise. I had to use the chip's internal 128 MHz cycle counter to measure the exact nanosecond the math started and stopped.

Core Architecture: The Two Limits (Compute vs. Memory)

To understand what your benchmark is telling you, you have to understand the fundamental architecture of modern computers (the Von Neumann architecture).

Every benchmark you run will hit one of two physical walls:

The "Compute Wall" (Compute-Bound)

The physical limit of how fast the math units (ALUs or MAC arrays) can crunch numbers.

When you hit it: When the data is already sitting inside the processor's internal registers. The math engine is running at 100% capacity, and the memory bus is waiting for it to finish.

Example: Calculating the digits of Pi, or running a Convolutional layer (where a tiny 9-byte image filter is reused millions of times).

The "Memory Wall" (Memory-Bound)

The physical limit of how fast the copper wires (the memory bus) can move data from the RAM into the math units.

When you hit it: When the processor is so fast that it finishes the math instantly and has to sit idle waiting for the next byte of data to arrive from RAM.

Example: Our 65,536-MAC Dense Layer! The NPU math array could theoretically do 32 or 64 multiplications per cycle, but the 32-bit (4-byte) memory bus could only deliver 4 weights per cycle. The NPU hit the Memory Wall, resulting in 4.10 MACs/cycle.

If you want to benchmark any chip in the future, follow my playbook:

1. Define the Workload (Apples to Apples)You must force both systems to do the exact same amount of physical work. I forced the Axon NPU and the Cortex-M33 to both calculate exactly 65, 536 Multiply-Accumulate operations.

2. Ensure Unpredictability (Defeat the Cheats)Never benchmark with arrays full of 0s or 1s. If the input data is predictable, the CPU will cache it, or the compiler will solve it during the build process. Always feed benchmarks with random, dynamic data so the hardware is forced to sweat.

3. Align the TimersAlways ensure you are measuring with the same stopwatch. In the very beginning I used the NPU 128 MHz clock and the CPU used the 32 kHz clock, it gave totally wrong results. Time the code as close to the hardware registers as possible, avoiding high-level OS functions like printf or sleep inside the timed loop.

4. Calculate the "Per-Cycle" TruthNever just look at "Time Elapsed in milliseconds." A chip running at 1 GHz will always finish faster in milliseconds than a chip running at 100 MHz, even if its architecture is worse.By dividing the total operations by the total clock cycles, you discover the true architectural efficiency of the silicon, completely independent of its clock speed.

Nordic Semiconductor Official Sources

Press Release (Sept 2023):"Nordic Semiconductor expands AI/ML strategy by acquiring Atlazo’s IP and world-class specialists" Source Link

nRF54L20 Product Brief: Source Link

Nordic DevZone - "Introduction to nRF54L Series": Source Link

Tirias Research - "AI at the Edge": Research Reference

Edge Impulse Integration: Edge Impulse Blog on nRF549

Maksim Masalski

10 projects • 36 followers

I am an inventor and software architect with a passion for AI, Physical AI, robotics, interactive toys and innovations using TRIZ approach.

Embed the widget on your own site

nRF54L20B Axon NPU Benchmark on Zephyr

nRF54L20B Axon NPU Benchmark on Zephyr

Things used in this project

Hardware components

Software apps and online services

Story

1. Introduction

2. Understanding the Axon NPU

2.1 Why the Axon NPU Uses INT8

2.2 Why Floating Point Is Not Supported in Axon NPU Hardware

2.3 Wake-on-AI Architecture

2.4 Typical Use Cases for the Axon NPU

2.5 TFLM Integration and the Compiler Pipeline

3 Theoretical Performance and Architecture Background

3.1 Axon NPU Technical data

3.2 Compute Throughput in GOPS

3.3 Atlazo Axon RDLA Architecture

3.4 Official Performance Claims

3.5 Nordic Specific Tools

4. Building the "MAC-Smasher" Benchmark Model

4.1 Setup Python Environment

4.2 Generate the Strict INT8 TFLite Model

4.3 Run the Model Generation Script

4.5 Why 256×256 Is the Right Test Size

5. Preparing the Nordic Toolchain and Hardware

5.1 Connect the nRF54LM20-DK

5.2 Open nRF Connect for Desktop

5.3 Verify the nrfutil Installation

5.4 Install the sdk-manager

5.5 Install the Required nRF Connect SDK Preview

5.6 Verify the SDK Installation

6 Compiling the Model for the Axon NPU

6.1 Create an Edge AI Workspace

6.2 Set ZEPHYR_BASE

6.3 Initialize the Workspace with west

6.4 Activate Conda Virtual Environment

6.5 Install the Compiler Requirements

6.6 Run the Axon Compiler

6.7 Check the Compiler Output

6.8 Find the Generated Header File

7 Setup nRF Connect Environment.

7.1 Install VSCode nRF Connect Extension

7.2 Validating the nRF Toolchain in VS Code

7.3 Validating the nRF SDK in VS Code

7.4 Open hello_axon application

7.5 Setup Build Configuration

7.6 Build project

7.7 Flash board

7.8 Open Serial Port Connection

8. Create Axon NPU benchmark application

8.1 Prepare the benchmark application

8.2 Zephyr project main.c benchmark code

8.3 Copy.h compiled file

8.4 Update project configuration

8.5 Setup Build Configuration

8.6 Flash benchmark to the board

8.7 Open Serial Port Connection

9 Benchmark Comparing Axon NPU vs Cortex CPU Performace

9.1 Update Zephyr project main.c benchmark code

9.2 Analyze the results

The Conclusion: Around 24x-25x Faster

10 How to Design Good Benchmark

Nordic Semiconductor Official Sources

Credits

Maksim Masalski

Comments

Related channels and tags