Sakana AI Claims Its "AI CUDA Engineer" Can Deliver 10-100× Performance Gains Over Plain PyTorch
LLM-based system converts PyTorch into CUDA kernels, then ruthlessly optimizes them — and, sometimes, cheats at its benchmarks.
UPDATE (2/24/2025): Sakana AI has issued a statement admitting that its "AI CUDA Engineer" may still cheat at benchmarks, after third-party analysis failed to verify the speed-ups the company had claimed after in-house testing.
"Combining evolutionary optimization with LLMs [Large Language Models] is powerful but can also find ways to trick the verification sandbox," the company's statement reads. "For example, the system had found a memory exploit in the evaluation code which, in a number of cases, allowed it to avoid checking for correctness. Furthermore, we find the system could also find other novel exploits in the benchmark’s tasks."
"We have since made the evaluation and runtime profiling harness more robust to eliminate many of such loopholes. We are in the process of revising our paper, and our results, to reflect and discuss the effects, and mitigation of LLM reward hacking for CUDA kernel optimization. We deeply apologize for our oversight to our readers. We will provide a revision of this work soon, and discuss our learnings."
Original article continues below.
Japanese artificial intelligence startup Sakana AI claims it has developed an "AI CUDA Engineer," which can convert PyTorch workloads into kernels that run workloads on NVIDIA's GPU hardware with a one- to two-orders-of-magnitude speed-up.
"The AI CUDA Engineer [is] the first comprehensive agentic framework for fully automatic CUDA kernel discovery and optimization," the company claims of its creation. "The AI CUDA Engineer is an agentic framework that leverages frontier LLMs with the goal of automating the conversion of standard PyTorch code into highly optimized CUDA kernels. Through the use of evolutionary optimization, and leveraging concepts in evolutionary computation, such as 'crossover' operations and 'innovation archive' to discover promising 'stepping stone' kernels, our proposed framework is able to not only automate the process of converting PyTorch modules to CUDA kernels, but our highly optimized CUDA kernels often achieve speedups that have significantly faster runtime."
The company claims that the CUDA kernels generated by the "engineer" operate between 10 and 100 times faster than then original PyTorch for "common PyTorch operations." For projects which already use CUDA kernels, the gains are less — but Sakana AI still claims its software can deliver a high-optimized kernel that delivers up to a fivefold speed increase.
These performance gains come through a three-step process: in the first step the tool translates PyTorch code into working CUDA kernels; in the second step these kernels go through a "survival of the fittest" evolutionary optimization process, which includes a kernel crossover prompting strategy capable of combining separate optimized kernels to improve performance further; finally, the system uses a long-term memory dubbed the "innovation archive" to provide additional performance improvements.
The biggest gains come from relatively specific workloads — such as diagonal matrix multiplication operations, which run around 57 times faster — though the company claims the approach can also be used to optimize the performance of entire machine learning architectures: VanillaRNNHidden showed a 7.02× performance gain over native operation, Sakana AI claims, while the EfficientNetB2 vision architecture ran 1.24× faster and the LeNet5 vision architecture ran 1.4× faster. Some early test results had to be thrown out, however, with Sakana AI discovering the large language model (LLM)-based engineer had found ways to "cheat" the benchmarks by reusing results from an earlier PyTorch run.
More information on Sakana AI's work, including a paper on the project, is available on the company's website; it has also released a dataset of more than 17,000 CUDA kernels covering a variety of PyTorch operations.