Overview
Requirements
Project Brief
Features covered in this session
One SHIM DMA pass through
Two SHIM DMA pass through
Four SHIM DMA pass through
Linear Operation
DataFlow
How to Install and Setup the Environment
Build and Run
Conclusion

Raghav_2028

•

yashpal

•

Tamilarasan Ravi

Published December 30, 2025

AMD Ryzen AI NPU Custom Kernel Implementation

This session focuses on running examples that demonstrate a data streaming into a basic Math kernel in NPU.

BeginnerFull instructions provided4 hours236

AMD Ryzen AI NPU Custom Kernel Implementation

Things used in this project

Hardware components

AMD Ryzen 7 8700G WIFI w/ Radeon 780M Graphics

Software apps and online services

Ubuntu

Story

Overview

This tutorial demonstrates a basic data streaming into a kernel which performs a basic math operation in AMD Ryzen AI Phoenix using AIE DIALECTS.

Requirements

AMD Ryzen AI Phoenix.Linux® based development environmentPython® (for test automation and result validation)IRON API and MLIR-based AI Engine Toolchain

Project Brief

The SOC is designed to accelerate the AIE-ML algorithms to deliver a good exceptional performance. NPU complex has

16 AI Cores for computation
4 Memory Tiles for fast memory access
4 SHIM DMA to Move data in and out of L3 Memory

Note : This project is customized for Phoenix

Features covered in this session

One SHIM DMA pass through to a data kernel
Two SHIM DMA pass through to a data kernel
Four SHIM DMA pass through to a data kernel

One SHIM DMA pass through

In this section SHIM DMA(0, 0), MEM Tile(0, 1) and Core(0, 2) of column 0 are used. A predefined set of data stored on L3 memory is streamed into the NPU complex. Data is routed from SHM DMA to Core via MEM Tile memory. A liner math operation is performed on the data and data is routed back. Received output stream is captured and compared with the reference.

Two SHIM DMA pass through

In this section SHIM DMA((0, 0), (1, 0)), MEM Tile((0, 1), (1, 1)) and Core((0, 2), (1, 2)) of column 0 and column 1 are used. A predefined set of data stored on L3 memory is streamed into the NPU complex. Data is routed from SHM DMA to Core via MEM Tile memory. A liner math operation is performed on the data and data is routed back. Received output stream is captured and compared with the reference.

Four SHIM DMA pass through

In this section SHIM DMA((0, 0), (1, 0), (2, 0)(3, 0)), MEM Tile((0, 1), (1, 1), (2, 1), (3, 1)) and Core((0, 2), (1, 2), (2, 2), (3, 2)) of column 0, 1, 2, 3 are used. A predefined set of data stored on L3 memory is streamed into the NPU complex. Data is routed from SHM DMA to Core via MEM Tile memory. A liner math operation is performed on the data and data is routed back. Received output stream is captured and compared with the reference

Linear Operation:

A linear equation formula describes a straight line, most commonly inSlope-Intercept Form:y = mx + c

m: The slope.
b: The y-intercept (point where the line crosses the y-axis).
x, y: Variables representing any point on the line.

DataFlow

Kernel Data Flow

How to Install and Setup the Environment

amd-ryzen-ai-npu-tool-chain-installation-and-execution

Build and Run

Navigating to one of the testcase folder and build the AIE design.Build the design using the make command on the test case path:

env use_placed=1 make

After completing a successful build, the host application was compiled and executed to run the design on the Ryzen AI NPU.

make run NPU1=1

This triggered the MLIR-AIE runtime, which offloaded computations to the NPU. We could see the accelerated results immediately.

Conclusion

With these tutorials we are able to demonstrate how to use the “IRON API and MLIR-based AI Engine Toolchain” and perform the data pass through, through a Math Operation. Further these tutorials will be extended to optimize kernel and the data throughput with single and many SHIM DMAs operating in parallel.

AMD Ryzen AI NPU Custom Kernel Implementation