Knowing What They Don't Know Could Deliver More Efficient Large Language Models
Focusing resources on the hard problems delivers measurable efficiency gains in LLMs, researchers say.
A team of researchers from the Massachusetts Institute of Technology (MIT), the MIT-IBM Watson AI Lab, and Red Hat AI Innovation has come up with a way to both improve the results from a large language model (LLM) while making it more computationally efficient — by having it focus on what it doesn't "know."
"The computational cost of inference has quickly become a major bottleneck for frontier model providers, and they are actively trying to find ways to improve computational efficiency per user queries," explains senior author Navid Azizan of the team's work. "For instance, the recent [OpenAI] GPT-5.1 release highlights the efficacy of the 'adaptive reasoning' approach our paper proposes. By endowing the models with the ability to know what they don't know, we can enable them to spend more compute on the hardest problems and most promising solution paths, and use far fewer tokens on easy ones. That makes reasoning both more reliable and far more efficient."
Large language models, the heart of the current AI boom, work by turning user input into "tokens" then finding the most statistically-likely continuation tokens derived from a vast training set culled from books, web pages, videos, emails, and more. It's a computationally-intensive process, and one which has seen unprecedented demand for processors and memory — tripling RAM spot prices in a matter of months, forcing hardware makers like Raspberry Pi to hike their prices.
Improving the efficiency of LLMs, then, is a must — and that's where the MIT approach comes in. The team's solution: instance-adaptive scaling, which dynamically adjusts the number of potential solutions based on a calculated likelihood of success — allocating more resources to what is seen to be the more difficult parts of the problem and pruning paths that would have likely wasted time and effort.
"This is how humans solve problems," co-author Hao Wang claims of the approach. "We come up with some partial solutions and then decide, should I go further with any of these, or stop and revise, or even go back to my previous step and continue solving the problem from there?"
"The beauty of our approach is that this adaptation happens on the fly," adds research scientist Kristjan Greenewald, "as the problem is being solved, rather than happening all at once at the beginning of the process."
The team's work is being presented at the Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS '25) this week, with a preprint available on Cornell's arXiv server. Code supporting the paper is available on GitHub.
Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire: freelance@halfacree.co.uk.