NVIDIA Makes GPU Acceleration Easier, More Portable with CUDA Tiles, cuTile Python
New tile-based programming paradigm aims to make GPU-accelerated code easier to write and more portable.
NVIDIA has announced the launch of CUDA 13.1, the latest release in its general-purpose graphics processing unit (GPGPU) offload language family — and it's bringing something new, designed to make it so developers don't have to worry about what hardware to target: CUDA Tiles.
"CUDA exposes a single-instruction, multiple-thread (SIMT) hardware and programming model for developers. This requires (and enables) you to exhibit fine-grained control over how your code is executed with maximum flexibility and specificity. However, it can also require considerable effort to write code that performs well, especially across multiple GPU architectures," NVIDIA's Jonathan Bentz and Tony Scudiero explain.
"With more complex hardware, more software is needed to help harness these capabilities. CUDA Tile abstracts away tensor cores and their programming models so that code using CUDA Tile is compatible with current and future tensor core architectures."
Using the new tile-based programming paradigm, developers specify "tiles" of data and define what computational operations are to be performed on those tiles — which, the company says, provides a more fine-grained approach than the traditional block-level task-parallel programming paradigm, in which an application maps data onto blocks then a compiler maps those blocks on to threads.
In tile programming, by contrast, the application maps data onto both blocks and threads. "Under the covers," Bentz and Scudiero claim, "the right things happen, and your computations continue completely transparent to you."
The new approach, dubbed CUDA Tile Intermediate Representation (CUDA Tile IR), is inspired by the way NumPy and other Python libraries handle workloads — and, in fact, NVIDIA has announced cuTile Python, an express of the tile-based programming model in Python. "cuTile automates block-level parallelism and asynchrony, memory movement, and other low-level details of GPU programming," Bentz and Scudiero say.
"It will leverage the advanced capabilities of NVIDIA hardware (such as tensor cores, shared memory, and tensor memory accelerators) without requiring explicit programming. cuTile is portable across different NVIDIA GPU architectures, enabling you to use the latest hardware features without rewriting your code."
cuTile Python is now available on GitHub, under the permissive Apache 2.0 license, and requires an NVIDIA GPU with compute capability 10.x or 12.x and driver version R580 or later, CUDA Toolkit 13.1 or later, and Python version 3.10 or later; CUDA Tile IR allows others to build their own domain-specific language compiler or library, with more information available in NVIDIA's documentation.