Nordic Semiconductor released the nRF54L20B with a dedicated "Axon NPU, " claiming massive speedups.
I wanted to find the physical limit of the silicon.
How many MACs (Multiply-Accumulates) can this thing actually do in a single clock cycle?
2. Understanding the Axon NPU2.1 Why the Axon NPU Uses INT8The Axon NPU is not a floating-point processor. It is optimized for 8-bit integer INT8 operations.
To understand NPUs, you have to understand the physical cost of math on silicon. Neural networks are essentially giant grids of multiplications and additions (Multiply-Accumulate, or MAC operations).
- Floating Point 32 (FP32): This is what models are trained on using massive data center GPUs. It is incredibly precise, but a hardware FP32 multiplier takes up a huge amount of physical silicon space and consumes a massive amount of electricity.
- Integer 8 (INT8): An INT8 multiplier is tiny. You can pack dozens of them into the space of a single FP32 multiplier. It also consumes a fraction of the power.
- The Memory Bottleneck: The biggest drain on a battery isn't doing the math; it's moving the data from the RAM to the processor. An INT8 weight is 1 byte. An FP32 weight is 4 bytes. By using INT8, you instantly quadruple your memory bandwidth and cut memory power consumption by 75%.
Therefore, the process of Quantization (squishing an FP32 model down to INT8) is the standard industry practice for running AI on any consumer device.
It is essentially a massive array of MAC (Multiply-Accumulate) units. In one clock cycle, it can perform dozens of multiplications and additions that would take a standard Cortex-M33 dozens of cycles to do sequentially.
To use it efficiently, engineers must quantize AI models.
Think of tasks where "close enough" is fine. This way you don't need high-precision floating point to detect a cough, a gesture, or a vibration pattern. In embedded or energy efficient edge-AI software engineers decided to trade a tiny bit of mathematical precision for a 10x-15x energy saving.
Because Nordic chips run on coin-cell batteries, the Axon NPU has zero hardware support for Floating Point math. Silicon space is too precious.
If you try to run a model with floating-point layers on the nRF54LM20B, the NPU simply refuses and passes the math back to the standard Arm Cortex-M33 CPU, which destroys power efficiency and speed.
2.3 Wake-on-AI ArchitectureIn a normal MCU, the CPU and the additional modules like SIM-module often fight over the same bus to get to the RAM. This creates "bottlenecks."
The Axon NPU has its own local memory/cache and a dedicated path to the system RAM. It uses a specialized DMA (Direct Memory Access) controller to pull data (like audio samples or IMU readings) without asking the CPU for help.
The NPU has "hardwired" logic for specific neural network layers. If your model uses these, it flies. If it uses others, it falls back to the slow CPU.
Supported Layers and CPU Fallback
The most important "gut" feature is the Hardware Trigger.
By doing this you can set up a chain:
Sensor -> DMA -> Axon NPU -> TriggerThe Cortex-M33 CPU can be in Deep Sleep (consuming almost zero power) while the NPU is "watching." Only when the NPU's inference result exceeds a confidence threshold (e.g., "I am 90% sure that was a dog barking") does it "wiggle a pin" to wake up the CPU.
This is "Wake-on-AI.
2.4 Typical Use Cases for the Axon NPUIt was invented for AI workload things that should be battery powered for 5 years.
A forest fire sensor that listens for the specific sound of a chainsaw or a tree falling.
A leak detector that "listens" to the ultrasonic hiss of a high-pressure pipe leak.
2.5 TFLM Integration and the Compiler PipelineThe Axon NPU doesn't run C code directly; it runs a compiled "graph."
Nordic provides a compiler that takes a .tflite file and turns it into an Instruction Stream for the NPU.
Because it uses the TFLM ecosystem, you can use standard tools Edge Impulse, TensorFlow, Keras.
3 Theoretical Performance and Architecture BackgroundTo benchmark the Axon NPU on the nRF54LM20B, I needed to look at both the theoretical "peak" numbers and the "effective" performance delivered by the compiler.
Because the Axon NPU is a proprietary architecture (acquired from Atlazo), Nordic expresses its performance primarily through relative speedups (e.g., "15x faster than CPU").
However, based on technical specifications and architecture analysis, my tutorial shows how I estimated and measured these values.
3.1 Axon NPU Technical data- Frequency: 128 MHz (0.128 GHz).
- MACs/cycle: While Nordic has not officially published the raw gate-level MAC count in the public datasheet, architecture analysis of the Atlazo Axon lineage and Nordic’s "15x speedup" claim suggests the core is optimized for parallel INT8 processing.
Usually Giga Operations Per Second (GOPS) value is calculated as:
GOPS=2×MACs per cycle×Clock Frequency (GHz)Note: I multiply by 2 because 1 MAC (Multiply-Accumulate) is considered 2 operations (1 multiply + 1 addition).
3.3 Atlazo Axon RDLA ArchitectureThe Axon NPU is based on a Reconfigurable Deep Learning Accelerator (RDLA) architecture. Unlike a generic DSP, it is a Domain-Specific Architecture (DSA).
It does not use standard RISC-V or ARM instructions. It uses a custom Instruction Stream generated by the Axon Compiler.
It is a "Stream Processor." It minimizes data movement by keeping weights and intermediate "feature maps" in local high-speed buffers (SRAM) rather than constantly hitting the system RAM.
The original Atlazo architecture (specifically the AZ-N1 lineage) featured a scalable MAC array. For the nRF54L series, Nordic has tuned this for the "Sweet Spot" of ultra-low power:
- Precision: Optimized for INT8. It supports "sub-byte" quantization (INT4/INT1) in the silicon, though the current Nordic toolchain focuses on INT8 for stability.
- Parallelism: Based on Atlazo’s design specs, the architecture is capable of performing 64 to 128 MACs per cycle depending on the configuration.
Based on the Atlazo IP heritage:
- MACs / Cycle & GOPS: The Atlazo Axon Reconfigurable Deep Learning Accelerator (RDLA) was originally designed around a highly parallel MAC array optimized for sub-milliwatt power. While Nordic hasn't published the exact MAC array size for the nRF54L20’s specific silicon implementation, we will determine the actual specs by the end of this tutorial; typically, NPUs in this power class (drawing ~3mA active) utilize a 32 or 64 MACs/cycle array. At the system clock speed of 128 MHz, this yields a theoretical peak of roughly 8 to 16 GOPS (Giga Operations Per Second)[Prnnewswire.com Nordic News].
- Native Hardware Acceleration: The compiler maps specific mathematical operations directly to the Axon silicon. If your model uses these, it achieves the 15x speedup.
They include:
- Conv1D and Conv2D (with specific acceleration for pointwise)[Nordic Academy nRF54L]
- Depthwise Convolution (without channel multipliers)
- Fully Connected (Dense) layers
- Pooling (Max / Average)
- Native hardware activation functions: ReLU, ReLU6, and LeakyReLU.
- CPU Fallback: Operations like Softmax, Tanh, or Sigmoid are not natively accelerated by the Axon silicon and are transparently handed back to the Cortex-M33 CPU by the Axon compiler. If your model relies heavily on these, your benchmarked GOPS will drop significantly.
- Data Types: The NPU is strictly optimized for INT8 quantized models (with options for INT32 model output).
Official Performance Metrics for nRF54LM20B:
- Speedup vs. CPU: The Axon NPU executes AI inference up to 15x faster than running the exact same model on the integrated 128 MHz Arm Cortex-M33.
- Energy Efficiency vs. CPU: Running models on the Axon NPU is up to 10x more energy-efficient than the Cortex-M33. I can't prove that because I don't have Nordic's Power Profiler Kit II.
- Competitive Benchmark: Nordic states the Axon NPU delivers up to 7x higher performance and 8x better energy efficiency compared to "the closest competing edge AI solution" operating in similar power budgets.
I can't provide a definitive comparison or proof yet; the only thing we will determine today is how the NPU's speed compares to the CPU's.
3.5 Nordic Specific ToolsNordic provides specific tools to extract the exact MAC utilization and efficiency for custom models. I do not need to calculate GOPS manually:
- Nordic Edge AI Lab (Released Jan 2026): This web-based tool allows to build models (or use their Text-to-Wake-Word generator) and directly compile them for the Axon NPU. It provides estimated latency and footprint benchmarks before you flash the device.
- Edge AI Add-on for nRF Connect SDK (v2.0+): This includes the Axon Compiler. When pass a
.tflitemodel through it, it generates a profiling report showing exactly which layers were hardware-accelerated and the precise execution time in milliseconds. I will use this add-on for benchmarking and working with the Axon NPU. - PC Simulator: Included in the SDK, this allows to get performance and latency estimates for the NPU without needing the physical nRF54LM20B development kit. I will use this PC simulator to learn more, as it seems like a very interesting tool.
Whether I'm looking at a tiny 9-milliwatt Nordic Axon NPU or a 30-watt Intel Core NPU inside a flagship laptop, they all share the exact same optimization pattern: relying on INT8 (8-bit integer) math for AI inference.
4. Building the "MAC-Smasher" Benchmark ModelNote: I'm using Mac OSTahoe 26.4
You will need a standard Python environment with TensorFlow installed.
4.1 Setup Python EnvironmentDownload Miniforge for macOS:
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh"Then install it:
bash Miniforge3-MacOSX-arm64.shFollow the prompts and type "yes" to initialize it, then close and reopen your terminal.
Create an environment with a perfectly stable, slightly older Python (like 3.11):
conda create -n npu_env python=3.11 -yNow activate it:
conda activate npu_envNow install standard TensorFlow:
pip install tensorflow numpy4.2 Generate the Strict INT8 TFLite ModelTo test the NPU’s absolute limit, I needed a pure math workload. I decided on a 256x256 Fully Connected (Dense) layer.
It requires exactly 65536 MAC operations and results in exactly 64KB of weights. Small enough to fit perfectly in the nRF54L's ultra-fast internal SRAM without bottlenecking the main RAM.
For such a purpose I wrote a short Python script called it generate_benchmark_model.py
mkdir <your-folder-with-diy-projects>/npu-model-gen-script
cd npu-model-gen-script
touch generate_benchmark_model.pyUsing TensorFlow/Keras to generate a strict INT8 quantized model axon_mac_smasher_int8.tflite
Paste the content below into the generate_benchmark_model.py file.
import tensorflow as tf
import numpy as np
# 1. Define the exact math we want to perform
# Input: Array of 256 numbers.
# Dense Layer: 256 neurons.
# Total MACs = 256 (inputs) * 256 (neurons) = 65,536 MAC operations.
inputs = tf.keras.Input(shape=(256,))
# We set use_bias=False to keep the math purely Multiplication and Addition (MAC)
# We use 'relu' because Axon natively accelerates it in hardware
x = tf.keras.layers.Dense(256, use_bias=False, activation='relu')(inputs)
model = tf.keras.Model(inputs=inputs, outputs=x)
# 2. Create a "Representative Dataset"
# This is MANDATORY for INT8 quantization. TensorFlow needs dummy data
# to figure out how to scale the floating-point numbers into 8-bit integers.
def representative_data_gen():
for _ in range(100):
# Generate random float32 data between -1.0 and 1.0
yield[np.random.uniform(-1.0, 1.0, size=(1, 256)).astype(np.float32)]
# 3. Configure the TFLite Converter for STRICT INT8
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
# Force the math to be INT8 only (No float fallbacks!)
converter.target_spec.supported_ops =[tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Force the Input and Output pins to be INT8
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
# 4. Convert and Save
tflite_model = converter.convert()
with open('axon_mac_smasher_int8.tflite', 'wb') as f:
f.write(tflite_model)
print("Model successfully generated: axon_mac_smasher_int8.tflite")
print("Total Expected MACs: 65,536")The NPU only speaks 8-bit integer math, so forcing INT8 is crucial to prevent the NPU from rejecting the model and handing it back to the slower CPU.
By doing this, I control the exact mathematical size of the model, and force strict INT8 quantization so the Axon NPU accepts 100% of the operations without falling back to the CPU.
4.3 Run the Model Generation ScriptSave this script as generate_benchmark_model.py and run it
(npu_env)% python3 generate_benchmark_model.pyIt is not broken, just wait patiently:
You will get output smth like
Saved artifact at '/var/folders/0n/nfvd_scd21d7hldt9n17_bf00000gn/T/tmp8zemt7la'. The following endpoints are available:
Endpoint 'serve'
args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 256), dtype=tf.float32, name='keras_tensor')
Output Type:
TensorSpec(shape=(None, 256), dtype=tf.float32, name=None)
Captures:
4817651088: TensorSpec(shape=(), dtype=tf.resource, name=None)
/Users/maxxlife/miniforge3/envs/npu_env/lib/python3.11/site-packages/tensorflow/lite/python/convert.py:846: UserWarning: Statistics for quantized inputs were expected, but not specified; continuing anyway.
warnings.warn(
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1774728156.181161 453968 tf_tfl_flatbuffer_helpers.cc:364] Ignored output_format.
W0000 00:00:1774728156.181173 453968 tf_tfl_flatbuffer_helpers.cc:367] Ignored drop_control_dependency.
I0000 00:00:1774728156.181417 453968 reader.cc:83] Reading SavedModel from: /var/folders/0n/nfvd_scd21d7hldt9n17_bf00000gn/T/tmp8zemt7la
I0000 00:00:1774728156.181547 453968 reader.cc:52] Reading meta graph with tags { serve }
I0000 00:00:1774728156.181551 453968 reader.cc:147] Reading SavedModel debug info (if present) from: /var/folders/0n/nfvd_scd21d7hldt9n17_bf00000gn/T/tmp8zemt7la
I0000 00:00:1774728156.182224 453968 mlir_graph_optimization_pass.cc:437] MLIR V1 optimization pass is not enabled
I0000 00:00:1774728156.182334 453968 loader.cc:236] Restoring SavedModel bundle.
I0000 00:00:1774728156.187968 453968 loader.cc:220] Running initialization op on SavedModel bundle at path: /var/folders/0n/nfvd_scd21d7hldt9n17_bf00000gn/T/tmp8zemt7la
I0000 00:00:1774728156.189636 453968 loader.cc:471] SavedModel load for tags { serve }; Status: success: OK. Took 8222 microseconds.
I0000 00:00:1774728156.195318 453968 dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
fully_quantize: 0, inference_type: 6, input_inference_type: INT8, output_inference_type: INT8
W0000 00:00:1774728158.375938 453968 flatbuffer_export.cc:3851] Skipping runtime version metadata in the model. This will be generated by the exporter.
Model successfully generated: axon_mac_smasher_int8.tflite
Total Expected MACs: 65,536
Endpoint 'serve'
args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 256), dtype=tf.float32, name='keras_tensor')
Output Type:
TensorSpec(shape=(None, 256), dtype=tf.float32, name=None)
Captures:
4817651088: TensorSpec(shape=(), dtype=tf.resource, name=None)
/Users/maxxlife/miniforge3/envs/npu_env/lib/python3.11/site-packages/tensorflow/lite/python/convert.py:846: UserWarning: Statistics for quantized inputs were expected, but not specified; continuing anyway.
warnings.warn(
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1774728156.181161 453968 tf_tfl_flatbuffer_helpers.cc:364] Ignored output_format.
W0000 00:00:1774728156.181173 453968 tf_tfl_flatbuffer_helpers.cc:367] Ignored drop_control_dependency.
I0000 00:00:1774728156.181417 453968 reader.cc:83] Reading SavedModel from: /var/folders/0n/nfvd_scd21d7hldt9n17_bf00000gn/T/tmp8zemt7la
I0000 00:00:1774728156.181547 453968 reader.cc:52] Reading meta graph with tags { serve }
I0000 00:00:1774728156.181551 453968 reader.cc:147] Reading SavedModel debug info (if present) from: /var/folders/0n/nfvd_scd21d7hldt9n17_bf00000gn/T/tmp8zemt7la
I0000 00:00:1774728156.182224 453968 mlir_graph_optimization_pass.cc:437] MLIR V1 optimization pass is not enabled
I0000 00:00:1774728156.182334 453968 loader.cc:236] Restoring SavedModel bundle.
I0000 00:00:1774728156.187968 453968 loader.cc:220] Running initialization op on SavedModel bundle at path: /var/folders/0n/nfvd_scd21d7hldt9n17_bf00000gn/T/tmp8zemt7la
I0000 00:00:1774728156.189636 453968 loader.cc:471] SavedModel load for tags { serve }; Status: success: OK. Took 8222 microseconds.
I0000 00:00:1774728156.195318 453968 dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
fully_quantize: 0, inference_type: 6, input_inference_type: INT8, output_inference_type: INT8
W0000 00:00:1774728158.375938 453968 flatbuffer_export.cc:3851] Skipping runtime version metadata in the model. This will be generated by the exporter.
Model successfully generated: axon_mac_smasher_int8.tflite
Total Expected MACs: 65,536The most important is at the end:
Model successfully generated: axon_mac_smasher_int8.tflite
Total Expected MACs: 65,536Notice this exact line in your output:
input_inference_type: INT8, output_inference_type: INT8This confirms the model is strictly locked to 8-bit integer math for the Axon NPU.
Right now, that .tflite file is sitting on my Mac's hard drive. To get it onto the nRF54LM20B using Zephyr, I need to convert it into a C-array so the Axon compiler can bake it into the chip's Flash memory.
I think it is important to explain a couple of important details from the Python code.
I specifically chose 256 for the input and the neurons (resulting in a matrix) because of four strict hardware rules when benchmarking an NPU.
# Total MACs = 256 (inputs) * 256 (neurons) = 65,536 MAC operations.Here is why 256 is the "magic number" for this exact test:
1. It creates exactly 64 Kilobytes of Weights (The SRAM Sweet Spot)
In an INT8 model, every single weight (connection between neurons) takes exactly 1 byte of memory.
The nRF54LM20B has 512 KB of total RAM, but the Axon NPU has its own internal, ultra-fast local cache (Tightly Coupled Memory) so it doesn't have to wait for the main system bus. 64 KB is small enough to fit entirely inside this ultra-fast memory.
- If I chose 1024x1024 (1 Megabyte): It wouldn't fit in the chip.
- If I chose 512x512 (256 KB): It might force the NPU to fetch data from the slower main RAM, meaning you are benchmarking the memory speed, not the math speed. 64 KB guarantees we are purely testing the silicon math engine.
2. Perfect Hardware Alignment (Powers of 2)
NPUs are designed with physical MAC arrays sized in powers of 2 (usually 16, 32, or 64 MACs working in parallel).If I give an NPU an odd number, like an input size of 250, the hardware still has to load 256 bytes into its 32-byte memory lanes, filling the empty spaces with zeros (padding). This wastes clock cycles and ruins the math.
Because 256 is perfectly divisible by 16, 32, 64, and 128, the NPU hardware operates at absolute 100% efficiency with zero wasted cycles.
3. Hiding the "Wake-Up" Overhead
When Zephyr RTOS tells the NPU to start, there is a slight delay (setup time, DMA configuration, triggering the hardware interrupt). This overhead might take a few dozen clock cycles.
- If I chose 16x16 (256 MACs): The NPU would finish the math so fast that my code's Zephyr timer would mostly just be measuring the OS overhead, giving me an artificially slow benchmark.
- At 256x256 (65, 536 MACs): The math takes long enough that the tiny OS overhead becomes a statistical rounding error, giving me the true hardware speed.
4. Math
When I will run the Zephyr benchmark and look at the total clock cycles, 65, 536 is incredibly easy to divide in your head to find the physical MAC array size:
- If it takes ~2, 000 cycles:
- If it takes ~1, 000 cycles:
So, 256 is carefully calculated to perfectly flood the Axon NPU's data pipes. I hope if Axon engineers will read it, maybe they could comment it out, and proof my assumptions.
Later we will run this .tflite model through the Nordic Axon compiler.
Connect the development kit to the Debugger USB port and move the power switch to the ON position.
Download it and install from here:
Open Board Configurator.
Select device.
Select connected board.
Now, you can review the entire board configuration.
Here is my board version.
Now, close the Board Configurator and return to the Apps window. Open Toolchain Manager.
Here is the list of available toolchains. You will need to install nRF Connect SDK v3.0.0-preview2 using the CLI. Clicking the Install with CLI button will redirect you to the nRF documentation; however, since their documentation can be a nightmare, I have provided a shortcut below in 5.3
Follow these commands to avoid digging through their pages.
5.3 Verify the nrfutil InstallationEnter the following command:
nrfutil listReview the output:
Command Version Description
completion 1.5.0
sdk-manager 1.11.0 Install and manage SDK and toolchain installations for the nRF Connect SDK.
Found 2 installed command(s)If you don't have nrfutil, download and install the appropriate version by following this link: https://www.nordicsemi.com/Products/Development-tools/nRF-Util/Download?lang=en#infotabs
Enter the following command:
nrfutil install sdk-managerReview the output:
nrfutil-sdk-manager already installed - use '--force' to uninstall and reinstall the command
[00:00:00] ###### 100% [Install packages] Install packages5.5 Install the Required nRF Connect SDK PreviewEnter the following command:
nrfutil sdk-manager install v3.3.0-preview3Review the output:
Using the global server for downloads. Use '--region cn' for the server in Mainland China.
[00:00:02] ------ 0% [Download toolchain v3.3.0-preview3] Downloading toolchainSDK installation may take some time, so please be patient and carefully review the output for any errors.
Might take up to 1 hour to finish.
% nrfutil sdk-manager install v3.3.0-preview3
Using the global server for downloads. Use '--region cn' for the server in Mainland China.
[00:16:01] ###### 100% [Download toolchain v3.3.0-preview3] Toolchain downloaded
[00:00:18] ###### 100% [Unpack toolchain v3.3.0-preview3] Toolchain unpacked to /opt/nordic/ncs/tmp/.tmpMLCL43
[00:00:01] ###### 100% [Install toolchain v3.3.0-preview3] Toolchain installed at /opt/nordic/ncs/toolchains/e023efa4ae
[00:55:30] ###### 100% [Download SDK v3.3.0-preview3] Download SDK v3.3.0-preview3
[00:00:06] ###### 100% [Calculating SDK checksum] Verified download
[00:00:33] ###### 100% [Unpack SDK v3.3.0-preview3] Unpacked SDK tarball5.6 Verify the SDK InstallationOnce complete, the latest toolchain will be installed on your macOS:
/opt/nordic/ncs/toolchains/<toolchain-hash-number-version>the latest SDK will be installed on your macOS:
/opt/nordic/ncs/<sdk-number-version>To perform a final check and verify your successful installations, run the following command:
% nrfutil sdk-manager listReview the output:
SDK Type SDK Version SDK Status Toolchain Status
nrf v3.3.0-preview3 Installed Installed6 Compiling the Model for the Axon NPUInstead of the CPU reading a standard TensorFlow array, Nordic engineers created the Edge AI Add-on.
The Axon NPU cannot read a standard TensorFlow.tflite array. It only speaks its own proprietary "hardware machine code."
Nordic built a secret compiler into Zephyr. When you hit "Build", Zephyr intercepts your .tflite file, translates the TensorFlow math into Axon hardware instructions, and generates its own special .h file containing NPU-specific memory pointers (not just raw bytes).
You have your axon_mac_smasher_int8.tflite file. Instead of using xxd to turn it into a flat C-array, you must run it through the Axon NPU Compiler. This compiler is included in the Edge AI Add-on you just installed.
6.1 Create an Edge AI WorkspaceRun the following commands to setup folder:
cd <your-folder-with-diy-projects>
mkdir axon-edge-ai
cd axon-edge-aiLaunch installed toolchain::
nrfutil sdk-manager toolchain launch --ncs-version v3.3.0-preview3 --shellReview the output:
Initializing shell environment!
(v3.3.0-preview3) macbook-maxxlife%6.2 Set ZEPHYR_BASEExport the correct environment variable ZEPHYR_BASE
export ZEPHYR_BASE=<your-folder-with-diy-projects>/axon-edge-aiFor example I have:
export ZEPHYR_BASE=/Users/maxxlife/github_repos/axon-edge-ai6.3 Initialize the Workspace with westNow initialize Zephyr repo with west:
west init -m https://github.com/nrfconnect/sdk-edge-aiUpdate all repositories, by running the following command:
west updateReview the output:
=== updating zephyr (zephyr):
--- zephyr: initializing
Initialized empty Git repository in /Users/maxxlife/github_repos/axon-edge-ai/zephyr/.git/
--- zephyr: fetching, need revision 4b6df5ff11b1a10a2ffa89dab7450c0af98c9e3a
remote: Enumerating objects: 1510594, done.
remote: Counting objects: 100% (411/411), done.
remote: Compressing objects: 100% (245/245), done.
Receiving objects: 9% (135954/1510594), 40.94 MiB | 4.59 MiB/Now, West will begin the workspace initialization. This process may take some time—typically around 15–20 minutes. You can monitor the workspace setup progress and view specific details directly in the terminal.
Now, exit the shell nrfutil by simply typing exit:
(v3.3.0-preview3) macbook-maxxlife% exitto get back to the environment:
(npu_env_test)6.4 Activate Conda Virtual EnvironmentSince you are using the nRF Connect SDK, you have the translator built in.
Make sure that you are stil in activated Conda virtual environment terminal, if no, run:
conda activate npu_env6.5 Install the Compiler RequirementsAfter west update check files in the folder using ls
(v3.3.0-preview3) macbook-maxxlife% ls
bootloader edge-ai modules nrf nrfxlib test tools zephyr Make sure you are still inside the tools/axon/compiler directory.
cd edge-ai/tools/axon/compilerNow that your Conda environment is active, tell it to install whatever extra math libraries Nordic needs for this script:
pip install -r scripts/requirements.txtMake sure you are still in your terminal inside the edge-ai/tools/axon/compiler directory.
cd <your-folder-with-diy-projects>/axon-edge-ai/edge-ai/tools/axon/compiler/We are going to create a brand new, minimal file called config.yaml.
Run this Python command in your terminal to generate config.yaml the file and paste the exact configuration into it:
Note: Ensure that you replace the tflite_model path with the correct one for your local system!
python -c '
import yaml
data = {
"mac_smasher_benchmark": {
"model_name": "axon_mac_smasher_int8",
"tflite_model": "/Users/maxxlife/github_repos/npu-model-gen-script/axon_mac_smasher_int8.tflite",
"interlayer_buffer_size": 70000,
"log_level": "info"
}
}
with open("config.yaml", "w") as f:
yaml.dump(data, f)
print("config.yaml generated perfectly!")
'6.6 Run the Axon CompilerEnter the following command:
python scripts/axons_ml_nn_compiler_executor.py config.yaml6.7 Check the Compiler OutputINFO: running mac_smasher_benchmark
INFO: starting compiler...
INFO: model name is : axon_mac_smasher_int8
INFO: test data or proper test labels are not provided, we have the tflite file.
INFO: running for variant axon_mac_smasher_int8
/Users/maxxlife/miniforge3/envs/npu_env/lib/python3.11/site-packages/cffi/cparser.py:436: UserWarning: #pragma in cdef() are entirely ignored. They should be removed for now, otherwise your code might behave differently in a future version of CFFI if #pragma support gets added. Note that '#pragma pack' needs to be replaced with the 'packed' keyword argument to cdef().
warnings.warn(
INFO: compiler intermediate outputs are saved in : /Users/maxxlife/github_repos/axon-edge-ai/edge-ai/tools/axon/compiler/intermediate/
executing compiler object at /Users/maxxlife/github_repos/axon-edge-ai/edge-ai/tools/axon/compiler/scripts/../bin/Darwin/libnrf-axon-nn-compiler-lib-arm64.dylib
open Compiler I/O files:
input model desc file: intermediate/nrf_axon_model_axon_mac_smasher_int8_bin_.bin
binary file version: 0x00001100
fully connected 1 x 256 into 256 x 256...done.
command_buf size=6632
model constants buffer size=66560
interlayer buffer size: provisioned 70000, needed 1280
Psum buffer size: provisioned 0, needed 0
Persistent Vars buffer size needed 0
Compiler generating model.h file
run_c_compiler_lib-INFO: model compiled successfully ....
run_c_compiler_lib-INFO: not running inference as test vectors are not provided
done executing compiler object at /Users/maxxlife/github_repos/axon-edge-ai/edge-ai/tools/axon/compiler/scripts/../bin/Darwin/libnrf-axon-nn-compiler-lib-arm64.dylib
INFO: done running variant axon_mac_smasher_int8, return code 0
INFO:
Performance_Metrics
Model_Variant Inference_Time Accuracy Precision Quantization_Loss Flash_Size Ram_Size Total_Memory
axon_mac_smasher_int8 None None None None 0 0 0
INFO: completed running mac_smasher_benchmark, return code 0(NRF_AXON_COMPILER_RESULT_SUCCESS), took 0.88 seconds6.8 Find the Generated Header FileThe .h file will be located in the outputs folder
<your-folder-with-diy-projects>/axon-edge-ai/edge-ai/tools/axon/compiler/outputs/nrf_axon_model_axon_mac_smasher_int8_.hLater, you will need to copy this .h file from the output directory to the /src project folder.
Ensure that you have the nRF Connect Extension installed in VS Code from Extensions Marketplace.
Open the nRF Connect Extension and verify that your SDKs and Toolchains are correctly configured.
Click tab Manage Toolchains.
Navigate to this section and click Validate Toolchain to ensure everything is set up correctly.
Select the specific toolchain version installed for this tutorial. Since I am working on multiple projects, I have several toolchains installed, but you should choose the one we just configured.
After selecting the toolchain to validate, the results will appear in the bottom-right corner of VS Code.
Follow the same steps to validate the SDK.
7.4 Open hello_axon applicationFirst, let's build the default hello_axon application to verify that everything is configured correctly.
Select and existing application folder.
Then choose path to the created workspace previously samples folder.
The path will be in my case
/Users/maxxlife/github-repos/axon-edge-ai/edge-ai/samples/axon/hello-axonIn your case it will smth like this
<your-folder-with-diy-projects>/axon-edge-ai/edge-ai/samples/axon/hello-axonClick Open to load the project workspace.
You will be automatically redirected to the VS Code Explorer view.
Now get back and open nRF Connect view to build the project.
Once you see the hello_axon project added to the list, you can proceed to configure and build it.
Click Add build configuration, then ensure you select the correct SDK and Toolchain versions from the dropdown menus.
Finally, click the Generate and Build button to generate the build files and compile the application.
Click Flash to run west flash and upload program to the board.
To view the application output, you must have the serial port open.
Choose VCOM1
Select connected device.
Now, press the physical Reset button on your hardware to restart the program and view the output in the terminal.
Since the hello_axon sample is running successfully, we can use it as a foundation for our new benchmark application.
Now you need to create Zephyr Benchmark Application.
8.1 Prepare the benchmark applicationThe best approach is to build your new application directly from the hello_axon sample. This ensures all environment variables and dependencies remain intact.
I recommend copying the folder hello_axon and renaming it to axon_benchmark to keep your workspace organized.
Open the CMakeLists.txt file in your new axon_benchmark folder and update the project name.
# Zephyr CMake project
find_package(Zephyr REQUIRED HINTS $ENV{ZEPHYR_BASE})
project(axon_benchmark)8.2 Zephyr project main.c benchmark codeUpdate main.c file with this code.
#include <drivers/axon/nrf_axon_driver.h>
#include <drivers/axon/nrf_axon_nn_infer.h>
#include <axon/nrf_axon_platform.h>
#include <zephyr/logging/log.h>
#include <zephyr/timing/timing.h>
// Include the generated file (with the trailing underscore!)
#include "nrf_axon_model_axon_mac_smasher_int8_.h"
LOG_MODULE_REGISTER(axon_benchmark);
int main(void)
{
nrf_axon_result_e result;
LOG_INF("--- Axon NPU MAC Benchmark ---");
// 1. Initialize the NPU Platform
if (nrf_axon_platform_init() != NRF_AXON_RESULT_SUCCESS) {
LOG_ERR("Platform init failed");
return -1;
}
// 2. Initialize Model Variables
if (nrf_axon_nn_model_init_vars(&model_axon_mac_smasher_int8) != 0) {
LOG_ERR("Model var init failed");
return -1;
}
// 3. Validate the Model
if (nrf_axon_nn_model_validate(&model_axon_mac_smasher_int8) != NRF_AXON_RESULT_SUCCESS) {
LOG_ERR("Model validation failed!");
return -1;
}
// 4. Prepare Input Data (Fill 256 array with dummy data)
int8_t *input_buffer = model_axon_mac_smasher_int8.inputs[0].ptr;
for (int i = 0; i < 256; i++) {
input_buffer[i] = 1;
}
int8_t output_buffer[256];
LOG_INF("Model Loaded. Expected MACs: 65,536");
LOG_INF("Starting 10-pass benchmark...");
// 5. Setup High-Res Timer
timing_init();
timing_start();
uint64_t total_cycles = 0;
const int num_runs = 10;
// 6. Run the Benchmark
for (int i = 0; i < num_runs; i++) {
timing_t start_time = timing_counter_get();
// BOOM! Fire the NPU!
result = nrf_axon_nn_model_infer_sync(&model_axon_mac_smasher_int8, input_buffer, output_buffer);
timing_t end_time = timing_counter_get();
if (result != NRF_AXON_RESULT_SUCCESS) {
LOG_ERR("Inference failed on run %d", i);
return -1;
}
total_cycles += timing_cycles_get(&start_time, &end_time);
}
timing_stop();
// 7. Calculate Results
uint64_t average_cycles = total_cycles / num_runs;
// Multiply by 100 to print two decimal places cleanly
uint64_t macs_x_100 = (65536 * 100) / average_cycles;
LOG_INF("Average Cycles per Inference: %llu", average_cycles);
LOG_INF("Calculated Hardware MACs/Cycle: %llu.%02llu", macs_x_100 / 100, macs_x_100 % 100);
LOG_INF("------------------------------");
return 0;
}8.3 Copy.h compiled fileCopy the compiled nrf_axon_model_axon_mac_smasher_int8.h file created in 6.8 to the /src folder so that main.c can detect the headers.
cp <your-folder-with-diy-projects>/axon-edge-ai/edge-ai/tools/axon/compiler/outputs/nrf_axon_model_axon_mac_smasher_int8_.h <your-folder-with-diy-projects>/axon-edge-ai/edge-ai/samples/axon/axon_benchmark8.4 Update project configurationCONFIG_NCS_SAMPLES_DEFAULTS=y
# Enable NPU
CONFIG_NRF_AXON=y
# Set buffer to what the compiler requested
CONFIG_NRF_AXON_INTERLAYER_BUFFER_SIZE=1280
CONFIG_NRF_AXON_PSUM_BUFFER_SIZE=0
# Enable Timing APIs for benchmarking
CONFIG_TIMING_FUNCTIONS=y
CONFIG_PICOLIBC_IO_FLOAT=y8.5 Setup Build ConfigurationUse the same build configuration as the "Hello World" sample.
Flash the compiled model to your nRF54LM20B dev kit.
8.7 Open Serial Port ConnectionOpen serial port again and view the application serial output:
*** Booting nRF Connect SDK v3.3.0-preview2-ede152ec210b ***
*** Using Zephyr OS v4.3.99-4b6df5ff11b1 ***
I: --- Axon NPU MAC Benchmark ---
I: Model Loaded. Expected MACs: 65,536
I: Starting 10-pass benchmark...
I: Average Cycles per Inference: 16110
I: Calculated Hardware MACs/Cycle: 4.06I flashed the board, opened the serial terminal, and saw the result: 16,110 average cycles, which mathematically equates to 4.06 MACs per cycle.
My Conclusion: Why exactly ~4? I think because it hit the Memory Wall. The internal memory bus on the Cortex-M33 architecture is 32 bits (4 bytes) wide. Since the Dense layer required 1 byte per weight, the NPU was physically limited to pulling 4 weights out of memory per clock cycle.
But it is just my assumption, and I need Nordic engineers to help figure that out.
The Takeaway: I think, the benchmark proved the Axon NPU's Direct Memory Access (DMA) and pipelining are so incredibly optimized that it crunches data at 100% of the physical speed limit of the copper traces on the TSMC 22nm silicon.
If I ran this exact same 256x256 benchmark on the standard 128 MHz Cortex-M33 CPU using C-code, it wouldn't get 4 MACs per cycle.
- The CPU has to load the instruction, load the data pointer, fetch the data, increment the pointer, multiply, add, and store.
- A highly optimized CPU usually achieves about 0.3 to 0.5 MACs per cycle on Dense layers.!
- By hitting a sustained 4.06, Axon NPU just proved Nordic's marketing claim: It is ~10x to 15x faster than the CPU!
Let's slightly update our code to see how much faster the Axon NPU handles this task compared to the Cortex CPU. Replace the existing code with the following:
#include <drivers/axon/nrf_axon_driver.h>
#include <drivers/axon/nrf_axon_nn_infer.h>
#include <axon/nrf_axon_platform.h>
#include <zephyr/logging/log.h>
#include <zephyr/timing/timing.h>
// Include the generated file (with the trailing underscore!)
#include "nrf_axon_model_axon_mac_smasher_int8_.h"
LOG_MODULE_REGISTER(axon_benchmark);
// We want O3 so the CPU runs at maximum physical speed
__attribute__((noinline, optimize("O3")))
void run_cpu_dense_layer(const int8_t *input, const int8_t *weights, int32_t *output) {
for (int out_idx = 0; out_idx < 256; out_idx++) {
int32_t accumulator = 0;
for (int in_idx = 0; in_idx < 256; in_idx++) {
accumulator += input[in_idx] * weights[out_idx * 256 + in_idx];
}
output[out_idx] = accumulator;
}
}
int main(void)
{
nrf_axon_result_e result;
LOG_INF("--- Axon NPU MAC Benchmark ---");
// 1. Initialize the NPU Platform
if (nrf_axon_platform_init() != NRF_AXON_RESULT_SUCCESS) {
LOG_ERR("Platform init failed");
return -1;
}
// 2. Initialize Model Variables
if (nrf_axon_nn_model_init_vars(&model_axon_mac_smasher_int8) != 0) {
LOG_ERR("Model var init failed");
return -1;
}
// 3. Validate the Model
if (nrf_axon_nn_model_validate(&model_axon_mac_smasher_int8) != NRF_AXON_RESULT_SUCCESS) {
LOG_ERR("Model validation failed!");
return -1;
}
// 4. Prepare Input Data (Fill 256 array with dummy data)
int8_t *input_buffer = model_axon_mac_smasher_int8.inputs[0].ptr;
for (int i = 0; i < 256; i++) {
input_buffer[i] = 1;
}
int8_t output_buffer[256];
LOG_INF("Model Loaded. Expected MACs: 65,536");
LOG_INF("Starting 10-pass benchmark...");
// 5. Setup High-Res Timer
timing_init();
timing_start();
uint64_t total_cycles = 0;
const int num_runs = 10;
// 6. Run the Benchmark
for (int i = 0; i < num_runs; i++) {
timing_t start_time = timing_counter_get();
// BOOM! Fire the NPU!
result = nrf_axon_nn_model_infer_sync(&model_axon_mac_smasher_int8, input_buffer, output_buffer);
timing_t end_time = timing_counter_get();
if (result != NRF_AXON_RESULT_SUCCESS) {
LOG_ERR("Inference failed on run %d", i);
return -1;
}
total_cycles += timing_cycles_get(&start_time, &end_time);
}
timing_stop();
// 7. Calculate Results
uint64_t average_cycles = total_cycles / num_runs;
// Multiply by 100 to print two decimal places cleanly
uint64_t macs_x_100 = (65536 * 100) / average_cycles;
LOG_INF("Average Cycles per Inference: %llu", average_cycles);
LOG_INF("Calculated Hardware MACs/Cycle: %llu.%02llu", macs_x_100 / 100, macs_x_100 % 100);
LOG_INF("------------------------------");
timing_start(); // TURN THE STOPWATCH BACK ON!
LOG_INF("==============================");
LOG_INF("--- Cortex-M33 CPU Benchmark ---");
timing_start(); // Restart the timer we stopped!
static int8_t cpu_input[256];
static int8_t cpu_weights[65536];
static int32_t cpu_output[256];
// The rest of the code remains exactly the same...
void (*volatile blind_cpu_func)(const int8_t *, const int8_t *, int32_t *) = run_cpu_dense_layer;
for (int i = 0; i < 65536; i++) {
cpu_weights[i] = (int8_t)(k_cycle_get_32() & 0xFF);
}
for (int i = 0; i < 256; i++) {
cpu_input[i] = (int8_t)(k_cycle_get_32() & 0xFF);
}
LOG_INF("CPU Model Loaded. Expected MACs: 65,536");
uint64_t total_cpu_cycles = 0;
int32_t dynamic_checksum = 0;
for (int i = 0; i < num_runs; i++) {
// Mutate the input
cpu_input[0] = (int8_t)(k_cycle_get_32() & 0xFF);
timing_t start_time = timing_counter_get();
// Execute the blind jump! The compiler is helpless here.
blind_cpu_func(cpu_input, cpu_weights, cpu_output);
timing_t end_time = timing_counter_get();
total_cpu_cycles += timing_cycles_get(&start_time, &end_time);
dynamic_checksum += cpu_output[i % 256];
}
uint64_t average_cpu_cycles = total_cpu_cycles / num_runs;
if (average_cpu_cycles == 0) average_cpu_cycles = 1;
uint64_t cpu_macs_x_100 = (65536 * 100) / average_cpu_cycles;
LOG_INF("Average Cycles per Inference: %llu", average_cpu_cycles);
LOG_INF("Calculated Hardware MACs/Cycle: %llu.%02llu", cpu_macs_x_100 / 100, cpu_macs_x_100 % 100);
LOG_INF("------------------------------");
uint64_t speedup_x_100 = (average_cpu_cycles * 100) / average_cycles;
LOG_INF(">>> AXON NPU SPEEDUP: %llu.%02llux FASTER <<<", speedup_x_100 / 100, speedup_x_100 % 100);
LOG_INF("==============================");
LOG_INF("CPU Math Checksum: %d", dynamic_checksum);
return 0;
}
Build and run the project again.
In the serial output, you’ll see the performance gap between the Axon NPU and the Cortex CPU while performing the same task. The difference is impressive.
9.2 Analyze the resultsYour serial output should look something like this:
*** Booting nRF Connect SDK v3.3.0-preview2-ede152ec210b ***
*** Using Zephyr OS v4.3.99-4b6df5ff11b1 ***
I: --- Axon NPU MAC Benchmark ---
I: Model Loaded. Expected MACs: 65,536
I: Starting 10-pass benchmark...
I: Average Cycles per Inference: 15975
I: Calculated Hardware MACs/Cycle: 4.10
I: ------------------------------
I: ==============================
I: --- Cortex-M33 CPU Benchmark ---
I: CPU Model Loaded. Expected MACs: 65,536
I: Average Cycles per Inference: 395748
I: Calculated Hardware MACs/Cycle: 0.16
I: ------------------------------
I: >>> AXON NPU SPEEDUP: 24.77x FASTER <<<
I: ==============================
I: CPU Math Checksum: 12412Look at that magnificent block of text. 24.77x FASTER!
I think, I have extracted the raw, undeniable physical truth of the nRF54LM20B silicon.
Let's break down the numbers:
1. The Cortex-M33 Reality (0.16 MACs/Cycle)
The CPU took 395,748 cycles to do 65,536 operations.
If you divide that out (395,748 / 65,536), it means the CPU took ~6 full clock cycles just to do one single MAC operation.
Why? I guess because for every single multiplication, the CPU had to:
- Calculate the memory address.
- Load the input byte over the bus.
- Load the weight byte over the bus.
- Multiply them.
- Add to the accumulator.
- Increment the loop counter and check if it should branch.
2. The Axon NPU Reality (4.06-4.10 MACs/Cycle)
The NPU took around 16k cycles.
I’m still not sure why there is always a slight difference in the cycle count—sometimes it's 16,110, and other times it's a bit more or less. AI suggests it's due to dynamic clock scaling, cache hits and misses, pipeline stalls, branch prediction, background interrupts, and bus contention. It’s likely right.
NPU didn't waste a single cycle on loop counters, branches, or instructions. It locked the 32-bit memory bus open and slammed 4 weights per clock cycle directly into its hardware math array, hitting the absolute physical "Speed of Light" allowed by the copper memory traces on the chip, probably.
The Conclusion: Around 24x-25x FasterNordic's marketing team claims the NPU is "up to 15x faster" because they benchmark it using full, complex models (like Audio Keyword Spotting) that have some CPU fallback overhead.
But I just built a pure, unadulterated bare-metal micro-benchmark. Stripped of all OS overhead, the raw silicon of the Axon NPU is actually nearly 25 times faster than the Cortex-M33 at crunching Dense AI layers!
- The Goal: Benchmarking the new unreleased Axon NPU.
- The Discovery: Finding out Nordic deleted TFLM and forces developer to use hardware machine code.
- The Hack: Digging into the hidden Python compiler, feeding it a custom config.yaml, and extracting the raw .h file.
- The Trap: Fighting ARM TrustZone (/ns vs Secure), GCC Compiler optimizations deleting the code, and a broken stopwatch. Yes, I didn't cover this stuff; you're just seeing the final working code without the backlog of my trials and errors.
- The Climax: The side-by-side execution revealing the "Memory Wall" (4.06-4.10 limit) and the incredible 24-25x Speedup.
My Key Learnings about benchmarks design.
The "Red Line" of Benchmarking: Strict Isolation
The golden rule of benchmarking is Isolation.
You must measure only the specific hardware or software you are trying to test, and absolutely nothing else. If you do not perfectly isolate the target, you will end up measuring "noise."
In this journey, I had to systematically strip away the noise:
- Software Noise: I threw away the TensorFlow Lite C++ interpreter because it has software overhead. I went "bare-metal" to measure the raw hardware.
- Compiler Noise: Compilers (like GCC) are designed to "cheat" by skipping math if they know the answer ahead of time. I had to use unpredictable hardware clocks (entropy) and blind function pointers to isolate the silicon's actual execution from the compiler's shortcuts.
- Measurement Noise: Using an OS-level timer (like a 32kHz sleep timer) adds milliseconds of noise. I had to use the chip's internal 128 MHz cycle counter to measure the exact nanosecond the math started and stopped.
Core Architecture: The Two Limits (Compute vs. Memory)
To understand what your benchmark is telling you, you have to understand the fundamental architecture of modern computers (the Von Neumann architecture).
Every benchmark you run will hit one of two physical walls:
The "Compute Wall" (Compute-Bound)
The physical limit of how fast the math units (ALUs or MAC arrays) can crunch numbers.
- When you hit it: When the data is already sitting inside the processor's internal registers. The math engine is running at 100% capacity, and the memory bus is waiting for it to finish.
Example: Calculating the digits of Pi, or running a Convolutional layer (where a tiny 9-byte image filter is reused millions of times).
The "Memory Wall" (Memory-Bound)
The physical limit of how fast the copper wires (the memory bus) can move data from the RAM into the math units.
- When you hit it: When the processor is so fast that it finishes the math instantly and has to sit idle waiting for the next byte of data to arrive from RAM.
Example: Our 65,536-MAC Dense Layer! The NPU math array could theoretically do 32 or 64 multiplications per cycle, but the 32-bit (4-byte) memory bus could only deliver 4 weights per cycle. The NPU hit the Memory Wall, resulting in 4.10 MACs/cycle.
If you want to benchmark any chip in the future, follow my playbook:
1. Define the Workload (Apples to Apples)You must force both systems to do the exact same amount of physical work. I forced the Axon NPU and the Cortex-M33 to both calculate exactly 65, 536 Multiply-Accumulate operations.
2. Ensure Unpredictability (Defeat the Cheats)Never benchmark with arrays full of 0s or 1s. If the input data is predictable, the CPU will cache it, or the compiler will solve it during the build process. Always feed benchmarks with random, dynamic data so the hardware is forced to sweat.
3. Align the TimersAlways ensure you are measuring with the same stopwatch. In the very beginning I used the NPU 128 MHz clock and the CPU used the 32 kHz clock, it gave totally wrong results. Time the code as close to the hardware registers as possible, avoiding high-level OS functions like printf or sleep inside the timed loop.
4. Calculate the "Per-Cycle" TruthNever just look at "Time Elapsed in milliseconds." A chip running at 1 GHz will always finish faster in milliseconds than a chip running at 100 MHz, even if its architecture is worse.By dividing the total operations by the total clock cycles, you discover the true architectural efficiency of the silicon, completely independent of its clock speed.
Nordic Semiconductor Official SourcesPress Release (Sept 2023):"Nordic Semiconductor expands AI/ML strategy by acquiring Atlazo’s IP and world-class specialists" Source Link
nRF54L20 Product Brief: Source Link
Nordic DevZone - "Introduction to nRF54L Series": Source Link
Tirias Research - "AI at the Edge": Research Reference
Edge Impulse Integration: Edge Impulse Blog on nRF549









Comments