Overview
Requirements
Project Brief
Features covered and Data Flow
One SHIM DMA pass through
Two SHIM DMA pass through
Four SHIM DMA pass through
Data Flow
How to Install and Setup the Environment
Build and Run
Conclusion

Raghav_2028

•

yashpal

•

Tamilarasan Ravi

Published December 18, 2025

AMD Ryzen AI NPU Data movement between L3 and AIE array

This session focuses on running examples that demonstrate how to perform a Shim data Movement from L3 to Tile memory.

BeginnerFull instructions provided4 hours171

AMD Ryzen AI NPU Data movement between L3 and AIE array

Things used in this project

Hardware components

AMD Ryzen 7 8700G WIFI w/ Radeon 780M Graphics

Software apps and online services

Ubuntu

Story

Overview

This tutorial demonstrates a data pass through example to explain the data flow in AMD Ryzen AI Phoenix using AIE DIALECTS.

Requirements

AMD Ryzen AI Phoenix.Linux® based development environmentPython® (for test automation and result validation)IRON API and MLIR-based AI Engine Toolchain

Project Brief

The SOC is designed to accelerate the AIE-ML algorithms to deliver a good exceptional performance. NPU complex has

16 AI Cores for computation
4 Memory Tiles for fast memory access
4 SHIM DMA to MoveNote : This project is customized for Phoenix.

Features covered and Data Flow

One SHIM DMA pass through
Two SHIM DMA pass through
Four SHIM DMA pass through

One SHIM DMA pass through

In this section SHIM DMA(0, 0), MEM Tile(0, 1) and Core(0, 2) of column 0 are used. A predefined set of data stored on L3 memory is streamed into the NPU complex. Data is routed from SHM DMA to Core via MEM Tile memory and it is routed back. Received output stream is captured and compared with the reference.

Two SHIM DMA pass through

In this section SHIM DMA((0, 0), (1, 0)), MEM Tile((0, 1), (1, 1)) and Core((0, 2), (1, 2)) of column 0 and column 1 are used. A predefined set of data stored on L3 memory is streamed into the NPU complex. Data is routed from SHM DMA to Core via MEM Tile memory and it is routed back. Received output stream is captured and compared with the reference.

Four SHIM DMA pass through

In this section SHIM DMA((0, 0), (1, 0), (2, 0)(3, 0)), MEM Tile((0, 1), (1, 1), (2, 1), (3, 1)) and Core((0, 2), (1, 2), (2, 2), (3, 2)) of column 0, 1, 2, 3 are used. A predefined set of data stored on L3 memory is streamed into the NPU complex. Data is routed from SHM DMA to Core via MEM Tile memory and it is routed back. Received output stream is captured and compared with the reference.

Data Flow

How to Install and Setup the Environment

https://www.hackster.io/541340/amd-ryzen-ai-npu-tool-chain-installation-and-execution-b252fa

Build and Run

Navigating to one of the testcase folder and build the AIE design:Build the design using the make command on the test case path:

env use_placed=1 make

After completing a successful build, the host application was compiled and executed to run the design on the Ryzen AI NPU.

make run NPU1=1

This triggered the MLIR-AIE runtime, which offloaded computations to the NPU. We could see the accelerated results immediately.

Conclusion

With these tutorials we are able to demonstrate how to use the “IRON API and MLIR-based AI Engine Toolchain” and perform the data pass through.Further these tutorials will be extended to characterize the data throughput with single and many SHIM DMAs operating in parallel.

AMD Ryzen AI NPU Data movement between L3 and AIE array