Accelerating AI Efficiency
Meta AI's custom silicon may increase the performance, scalability, and efficiency of inferencing in future AI-based applications.
When it comes to AI hardware, most people tend to think of the resources needed to train a large model. And perhaps this is unsurprising, considering that the quantity of GPUs and memory needed to train a model with billions of parameters on billions, or even trillions, of data samples can be truly staggering. But this is only the beginning for a machine learning algorithm. These algorithms will spend most of their lifetime running inferences, long after the training process has been completed.
Considering that models do all of their useful work when they are running inferences, it makes sense that special attention should be given to this side of the equation as well. It may end up being the case that the same hardware that was used to train the model is not the optimal choice for running inferences.
Engineers at Meta AI have been working to understand how they can better optimize inferences for a number of their specific use cases, including content understanding, Feeds, generative AI, and ads ranking. They identified that deep learning recommendation models are especially important to the services and applications that they provide, so they explored ways to improve the experience.
Notably, they found that GPUs were not the optimal solution for running their recommendation workloads efficiently. To solve this problem, they designed their own application-specific integrated circuit called the Meta Training and Inference Accelerator (MTIA). Support for the chip was integrated into PyTorch to make development for the platform as simple as possible.
After designing MTIA, it was fabricated with Taiwan Semiconductor Manufacturing Company’s 7 nanometer process. The accelerator consists of a grid of 64 processing elements (PEs) running at 800 MHz. Up to 128 GB of off-chip LPDDR5 DRAM can be accessed, and 128 MB of on-chip SRAM is available to the PEs for lower latency access to frequently used data and instructions. Each PE has a pair of general-purpose RISC-V processor cores, and also a number of purpose-built units that perform specialized, critical functions, like matrix multiplications. The PEs also have 128 KB of local SRAM for storing data that is actively being operated on.
The system is governed by a dedicated control subsystem that runs the system’s firmware. This firmware manages all of the compute and memory resources, communicates with the host through a dedicated interface, and manages the execution of jobs on the MTIA.
MTIAs were built into dual M.2 boards, which provide the interconnects and supporting hardware to interface them with a host system via PCIe 4.0 x8 links. As implemented by Meta AI, each host system contains twelve MTIA boards, which are connected directly to the host CPU. With the help of a hierarchy of PCIe switches, the MTIAs can also communicate directly with one another, without going through the host CPU. This allows for heavy parallelization with minimal overhead. That not only increases performance, but also energy efficiency.
To understand how the finished MTIA chip stacked up against other AI accelerators, especially with respect to efficiency, and by extension, total cost of ownership, the team ran some experiments. The MTIA was compared with an NNPI accelerator and a GPU while running inferences against deep learning recommendation models of varying sizes and complexities. It was discovered that for lower complexity models, the MTIA more efficiently handled small shapes and batch sizes. But when the higher complexity models were evaluated, the GPU was found to be a better solution. The engineers explained that this is likely due to the fact that they are still optimizing their software stack, and they expect this result to shift more in MTIAs favor with a bit more work.
Meta AI is using the lessons that they learned in producing their own, custom silicon to develop better architectures and software stacks for future hardware. They believe that continuing down this path will lead to improved performance and scaling of future systems.