This session demonstrates Multi dimension DMA capabilities of AIE shim DMA of AMD Ryzen AI Phoenix using AIE DIALECTS and AIE API.
RequirementsAMD Ryzen AI Phoenix.
Linux® based development environment
Python® (for test automation and result validation)
IRON API and MLIR-based AI Engine Toolchain
Project BriefThe SOC is designed to accelerate AIE-ML algorithms, delivering good to exceptional performance. The NPU complex has
- 16 AI Cores for computation
- 4 Memory Tiles for fast memory access
- 4 SHIM DMA to Move data in and out of L3 Memory
Note: This project is customized for Phoenix.
Features covered in this sessionHow to configure the shim DMA of “AMD Ryzen AI NPU”.
ArchitectureMultidimensional Direct Memory Access (DMA) access is highly useful in modern computing—particularly for AI, image processing, and high-performance computing (HPC)—because it allows the hardware to efficiently move non-contiguous, multi-dimensional data blocks (like matrices or 2D images) without constant CPU intervention. While standard DMA handles contiguous data, multidimensional DMA understands strides and data layout, allowing it to extract, reorder, or pack complex data structures directly between memory and processing units.
1. Significant Reduction in CPU Overhead
In typical applications, the CPU spends considerable time calculating addresses for scattered data or copying data piece by piece. Multidimensional DMA offloads this work, allowing the CPU to perform computations (like matrix multiplication) while the DMA engine handles data movement in the background.
2. Efficient Handling of Non-Contiguous Data
Often, required data (e.g., a sub-image or a sub-matrix) is scattered in memory, rather than being in a single contiguous block.
- Strided Transfers: Multidimensional DMA can access data using strides, allowing it to read a column or a specific patch of an image from a row-major memory structure in a single operation.
- Scatter-Gather: It can collect scattered fragments of data from memory and assemble them into a contiguous block for the accelerator, or vice-versa.
3. Optimization for AI and Image Processing
- Image Handling: In video or image processing, 2D blocks (pixels, patches) are moved without needing to copy the entire image, reducing unnecessary bus traffic.
- Matrix Multiplication (GEMM): For AI workloads, it enables efficient loading of "tiles" of matrices into tightly coupled memory (like Shared Memory in GPUs), which is critical for high-performance computing.
4. Reduced Memory Latency
By moving data directly between global memory and local, high-speed, on-chip scratchpad memory, it keeps processor cores fed with data, bypassing the higher latency of main memory or caches. This is essential for preventing stalls in high-throughput applications like neural network inference.
5. Advanced Data Manipulation
Multidimensional DMA often supports advanced features beyond simple movement:
- Transpose: Transposing a matrix during the copy process to align with the expected input format of a compute unit.
- Padding: Automatically adding padding values to the edges of a 2D block for alignment.
Array has a Shim, which interfaces with the Thin-NoC and the PL. Shim DMA has Multidimensional addressing capabilities - up to 3 dimensions.
Sample use caseSub-Block 8x8 extraction from a larger block of 64x16
M number of Rows
K number of columns
Ks number of columns in the sub-block
MM2S Source DMA is configured for a Multidimensional read from L3(DDR) and S2MM DMA is configured for a 1-D write to L3(DDR)
SHIM Multidimensional DMA configuration
Sizes [ 1, K, M, Ks]
Strides [ 1, Ks, K, 1]
Below is a simplified example to extract a 3x2 sub-block from a larger 4x4 matrix.
Config = Sizes: [1, 4, 3, 2], Strides: [1, 2, 4, 1]
The DMA processes these from right to left:
- Read Row (Size 2, Stride 1): Grab 2 numbers from row with a jump 1.
- Next Row (Size 3, Stride 4): Jump 4 spots (the full matrix width) to start the next row. Do this 3 times.
- Next Block (Size 4, Stride 2): Once the 3x2 block is done, move the starting pointer by 2 to grab the next tile.
Data Flow of sub block extraction
To generate a data visualization of the transpose (like that above), run:
make generate_access_mapTo compile the placed design:
env use_placed=1 makeTo run the design:
make runInput Data:
A linear sequence from 1 to 1024 (int32).
OutputData:
https://www.hackster.io/541340/amd-ryzen-ai-npu-tool-chain-installation-and-execution-b252fa
ConclusionThis session demonstrates how to use the “IRON API and MLIR-based AI Engine Toolchain” and “AIE API” to configure a SHIM DMA for a Multidimension use case.










Comments