OverView
Requirements
Project Brief
Features covered in this session
Architecture
Multidimensional DMA
Sample use case
How to Build and Run
How to Install and Set Up the Environment
Conclusion

Ajo E Jose

•

yashpal

•

Tamilarasan Ravi

Published February 9, 2026

AMD Ryzen AI NPU: Multi-Dimensional Shim DMA

This session demonstrates Multi dimension DMA capabilities of AIE shim DMA of AMD Ryzen AI Phoenix using AIE DIALECTS and AIE API.

BeginnerFull instructions provided4 hours129

AMD Ryzen AI NPU: Multi-Dimensional Shim DMA

Things used in this project

Hardware components

AMD Ryzen 7 8700G WIFI w/ Radeon 780M Graphics

Software apps and online services

Ubuntu

Story

OverView

This session demonstrates Multi dimension DMA capabilities of AIE shim DMA of AMD Ryzen AI Phoenix using AIE DIALECTS and AIE API.

Requirements

AMD Ryzen AI Phoenix.

Linux® based development environment

Python® (for test automation and result validation)

IRON API and MLIR-based AI Engine Toolchain

Project Brief

The SOC is designed to accelerate AIE-ML algorithms, delivering good to exceptional performance. The NPU complex has

16 AI Cores for computation
4 Memory Tiles for fast memory access
4 SHIM DMA to Move data in and out of L3 Memory

Note: This project is customized for Phoenix.

Features covered in this session

How to configure the shim DMA of “AMD Ryzen AI NPU”.

Architecture

Multidimensional DMA

Multidimensional Direct Memory Access (DMA) access is highly useful in modern computing—particularly for AI, image processing, and high-performance computing (HPC)—because it allows the hardware to efficiently move non-contiguous, multi-dimensional data blocks (like matrices or 2D images) without constant CPU intervention. While standard DMA handles contiguous data, multidimensional DMA understands strides and data layout, allowing it to extract, reorder, or pack complex data structures directly between memory and processing units.

1. Significant Reduction in CPU Overhead

In typical applications, the CPU spends considerable time calculating addresses for scattered data or copying data piece by piece. Multidimensional DMA offloads this work, allowing the CPU to perform computations (like matrix multiplication) while the DMA engine handles data movement in the background.

2. Efficient Handling of Non-Contiguous Data

Often, required data (e.g., a sub-image or a sub-matrix) is scattered in memory, rather than being in a single contiguous block.

Strided Transfers: Multidimensional DMA can access data using strides, allowing it to read a column or a specific patch of an image from a row-major memory structure in a single operation.
Scatter-Gather: It can collect scattered fragments of data from memory and assemble them into a contiguous block for the accelerator, or vice-versa.

3. Optimization for AI and Image Processing

Image Handling: In video or image processing, 2D blocks (pixels, patches) are moved without needing to copy the entire image, reducing unnecessary bus traffic.
Matrix Multiplication (GEMM): For AI workloads, it enables efficient loading of "tiles" of matrices into tightly coupled memory (like Shared Memory in GPUs), which is critical for high-performance computing.

4. Reduced Memory Latency

By moving data directly between global memory and local, high-speed, on-chip scratchpad memory, it keeps processor cores fed with data, bypassing the higher latency of main memory or caches. This is essential for preventing stalls in high-throughput applications like neural network inference.

5. Advanced Data Manipulation

Multidimensional DMA often supports advanced features beyond simple movement:

Transpose: Transposing a matrix during the copy process to align with the expected input format of a compute unit.
Padding: Automatically adding padding values to the edges of a 2D block for alignment.

Array has a Shim, which interfaces with the Thin-NoC and the PL. Shim DMA has Multidimensional addressing capabilities - up to 3 dimensions.

Sample use case

Sub-Block 8x8 extraction from a larger block of 64x16

M number of Rows

K number of columns

Ks number of columns in the sub-block

MM2S Source DMA is configured for a Multidimensional read from L3(DDR) and S2MM DMA is configured for a 1-D write to L3(DDR)

SHIM Multidimensional DMA configuration

Sizes [ 1, K, M, Ks]

Strides [ 1, Ks, K, 1]

Below is a simplified example to extract a 3x2 sub-block from a larger 4x4 matrix.

Config = Sizes: [1, 4, 3, 2], Strides: [1, 2, 4, 1]

The DMA processes these from right to left:

Read Row (Size 2, Stride 1): Grab 2 numbers from row with a jump 1.
Next Row (Size 3, Stride 4): Jump 4 spots (the full matrix width) to start the next row. Do this 3 times.
Next Block (Size 4, Stride 2): Once the 3x2 block is done, move the starting pointer by 2 to grab the next tile.

Data Flow of sub block extraction

To generate a data visualization of the transpose (like that above), run:

make generate_access_map

How to Build and Run

To compile the placed design:

env use_placed=1 make

To run the design:

make run

Input Data:

A linear sequence from 1 to 1024 (int32).

OutputData:

How to Install and Set Up the Environment?

https://www.hackster.io/541340/amd-ryzen-ai-npu-tool-chain-installation-and-execution-b252fa

Conclusion

This session demonstrates how to use the “IRON API and MLIR-based AI Engine Toolchain” and “AIE API” to configure a SHIM DMA for a Multidimension use case.

AMD Ryzen AI NPU: Multi-Dimensional Shim DMA

Things used in this project

Hardware components

Software apps and online services

Story

OverView

Requirements

Project Brief

Features covered in this session

Architecture

Multidimensional DMA

Sample use case

How to Build and Run

How to Install and Set Up the Environment?

Conclusion

Code

AIE Phoenix Multi-Dimensional Shim DMA

Credits

Ajo E Jose

yashpal

Tamilarasan Ravi

Comments

Embed the widget on your own site

AMD Ryzen AI NPU: Multi-Dimensional Shim DMA

AMD Ryzen AI NPU: Multi-Dimensional Shim DMA

Things used in this project

Hardware components

Software apps and online services

Story

OverView

Requirements

Project Brief

Features covered in this session

Architecture

Multidimensional DMA

Sample use case

How to Build and Run

How to Install and Set Up the Environment?

Conclusion

Code

AIE Phoenix Multi-Dimensional Shim DMA

Credits

Ajo E Jose

yashpal

Tamilarasan Ravi

Comments

Related channels and tags