Your Attention, Please

Infini-attention allows Transformer-based LLMs to handle infinitely long inputs, yet it does not increase memory or processing requirements.

Infini-attention transformers maintain the entire context history (📷: T. Munkhdalai et al.)

Since their introduction, large language models (LLMs) have proven themselves to be useful for a number of tasks like natural language understanding, text generation, translation, sentiment analysis, summarization, and question answering. The Transformer-based architecture underpinning these models has been pivotal in making these cutting-edge applications a reality. However, the same attention mechanism that helps to give Transformers their powerful capabilities is also holding them back from powering the next generation of machine learning applications — especially where a lot of context is required.

The attention mechanism plays a crucial role in Transformers by allowing them to focus on relevant parts of the input sequence during processing. However, this mechanism exhibits quadratic complexity in both memory utilization and computation time. This complexity arises from the need to compute attention scores between all pairs of positions in the input sequence, resulting in significant resource requirements. The attention Key-Value (KV) states of a 500 billion parameter LLM with a context length of 2,048 requires a whopping three terabytes of memory, for example.

Should far more context be needed, perhaps to summarize a full-length book, the memory requirements quickly become unmanageable. To overcome this issue, compressive memory systems, in which a fixed number of parameters are used regardless of the input size, have been proposed. Unfortunately, no practical and effective compressive memory systems have been integrated into LLMs to date.

New research conducted by a team of engineers at Google is seeking to change that, however. They have developed a novel approach called Infini-attention that enables Transformer-based LLMs to process infinitely long input sequences. And no matter the size of the input, the memory and computational requirements are fixed, making Infini-attention practical and efficient for use cases requiring any amount of context.

A compressive memory system allows for infinite context (📷: T. Munkhdalai et al.)

Infini-attention incorporates a compressive memory into the normal attention mechanism used by an LLM. It also integrates both masked local attention and long-term linear attention mechanisms within a single Transformer block. This combination enables the model to effectively capture both short- and long-range dependencies in the input sequence.

Unlike the standard attention mechanism, which typically discards old KV states after computation, Infini-attention retains these states in the compressive memory. This enables long-term memory consolidation and retrieval, ensuring that past information is not lost and can be utilized for subsequent sequences. Finally, the system aggregates the retrieved values from long-term memory with the local attention contexts to compute the final contextual output. This integration ensures that both short and long-range dependencies are appropriately considered in the output generation process.

The team conducted some experiments to demonstrate the utility of their approach. In these trials, it was found that Infini-attention surpasses baseline models in tasks related to long-context language modeling. A significant improvement in memory efficiency, along with a comprehension ratio 114 times higher than baseline models was also observed. This means that the proposed method achieves better performance while requiring much less memory, which is crucial for scalability and resource efficiency. In another test, a relatively small 8 billion parameter model equipped with Infini-attention achieved a new state-of-the-art result in a book summarization task involving sequences of 500,000 input tokens.

Simply scaling up the size of models and the hardware resources that they use is quickly growing impractical. Such methods will prove to be unsustainable as the next generation of AI tools emerge. But with approaches like Infini-attention on the horizon, the future of AI is looking much brighter.

machine learning

artificial intelligence

energy efficiency

language

Nick Bild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Your Attention, Please

Infini-attention allows Transformer-based LLMs to handle infinitely long inputs, yet it does not increase memory or processing requirements.

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles