DeepSeek's Latest LLM Pays Less Attention, So You Can Talk to It for Longer

"Sparse attention" feature trims unnecessary tokens in order to avoid overflowing the context window.

Chinese artificial intelligence firm DeepSeek has released its latest large language model, DeepSeek-V3.2-Exp — an "experimental version" that the company says includes a "sparse attention" mechanism, which can improve performance when given long inputs.

"We are excited to announce the official release of DeepSeek-V3.2-Exp, an experimental version of our model," the company says of its latest release. "As an intermediate step toward our next-generation architecture, V3.2-Exp builds upon V3.1-Terminus by introducing DeepSeek Sparse Attention — a sparse attention mechanism designed to explore and validate optimizations for training and inference efficiency in long-context scenarios. This experimental release represents our ongoing research into more efficient transformer architectures, particularly focusing on improving computational efficiency when processing extended text sequences."

DeepSeek has an interesting selling point for its latest LLM: "sparse attention," which should improve performance for longer token streams. (📷: DeepSeek)

Large language models are the lifeblood of the current "artificial intelligence" boom, despite being entirely devoid of intelligence themselves. Trained on vast, and typically ill-gotten, troves of data with no regard for copyright or permission, they transform user-provided inputs into a stream of"tokens" and then return the most statistically-likely tokens required to continue the stream — which if you're asking a question means a response that looks like an answer, and if you're lucky may even be usable in place of the real thing.

DeepSeek shot to fame with the release of its first open model, DeepSeek-R1, thanks to claims that it had trained it to the point of being on-par with equivalent proprietary models from the like of OpenAI and Meta with a "mere" $10 million budget — though critics were quick to point out that it was standing on the shoulders of giants, particularly with its smaller "distilled" variants, which were openly based on existing Qwen and Meta Llama models.

Like all LLMs, though, DeepSeek-R1 had its limitations — even putting aside the fundamental problem of LLMs being unable to "understand" in any meaningful way, leading to responses known as "hallucinations" that are entirely divorced from reality. A key issue is in the size of the "context window," or the number of tokens the model can keep in memory at any given time. When a long enough conversation — or one with a large number of inputs such as an ill-considered request to summarize lengthy documents, an oft-cited LLM-friendly task that can have dire results if not carefully compared to a real summary from someone who has actually read the document in question — creates a stream longer than the context window, the response becomes increasingly likely to be off-kilter and counterfactual.

DeepSeek claims the new model has performance "on par" with its predecessor, with the advantage of better handling of long context windows. (📷: DeepSeek)

As a Band-Aid over this problem, DeepSeek-V3.2-Exp includes the company's implementation of a "sparse attention" system, dubbed DeepSeek Sparks Attention (DSA). Described as a "prototype," this is designed to prune tokens in such a way as to maximize the useful context provided to the model while minimizing the overall length of the token stream — avoiding context window overflow. In many benchmarks, particularly those which involve tool use to create an "agentic" model capable of taking action on its user's behalf, this delivers a small performance gain; for others, including benchmarks that do not exceed the context window and thus could not be expected to benefit from sparse attention, an equally small performance loss.

More information is available in the project's GitHub repository, along with links to demos and kernels; the model weights, along with the contents of the repository itself, have been released under the permissive MIT license. Additional information is available on Hugging Face.

machine learning

artificial intelligence

Gareth Halfacree

Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire: freelance@halfacree.co.uk.

DeepSeek's Latest LLM Pays Less Attention, So You Can Talk to It for Longer

"Sparse attention" feature trims unnecessary tokens in order to avoid overflowing the context window.

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles