Getting LLM Efficiency Over the Hump

Camel is an edge LLM inference framework that cuts energy use and latency by up to 30% via smart GPU and batch tuning.

Large Language Models (LLMs) took off in a huge way in recent years as developers of these algorithms started massively scaling up their complexity and parameter counts. At the time, there seemed to be no limits to what they could do — just add more compute for amazing results. But as parameter counts reached into the trillions, users started to experience diminishing returns. Much was also made of the tremendous amount of energy these massive models require for operation. Furthermore, being so computationally complex, the models could only run in powerful remote data centers, adding latency and privacy concerns into the mix.

Taken together, these factors have led researchers to put more effort into optimizing LLMs to be more efficient. If large parameter counts are not the magic beans we thought they were, then perhaps the same knowledge can be encoded into a smaller model. And a smaller model can offer privacy, low latency for real-time operation, and energy efficiency. These efforts have already been paying dividends as we have seen with the release of the relatively pint-sized Gemma 3 and GPT OSS models.

An overview of Camel (📷: H. Xu et al.)

But as anyone who has ever worked with any of these models can attest, they do not perform as well as the big flagship models running in the cloud. A group of researchers at the National University of Defense Technology has been working to improve scaled-down models, however, which could help to make the future of edge AI brighter. In particular, they are working to optimize the trade-off between energy consumption and latency for LLM inference on edge devices.

Larger batch sizes improve efficiency by processing multiple requests simultaneously, but they also increase the time any single request waits before being addressed. Meanwhile, higher GPU frequencies reduce latency by speeding up computation, but they also draw more power. Smaller batches and lower frequencies have the opposite effects. The trade-off is not straightforward, and naïve approaches risk either draining batteries too quickly or creating unacceptable delays.

To address this, the team developed a framework called Camel, designed specifically to optimize both GPU frequency and batch size for edge-based LLM inference. They modeled the problem as a Multi-armed Bandit challenge, a class of optimization problems that balances exploration (trying different settings) with exploitation (sticking with the best-known option). By using a Thompson Sampling approach, Camel dynamically learns the optimal configuration over time, adjusting as conditions change.

The trade-off between energy consumption and latency for varying batch sizes and GPU frequencies (📷: H. Xu et al.)

The framework was implemented and tested on the NVIDIA Jetson AGX Orin, a popular development board for AI at the edge. Using models such as Llama3.2-1B and Qwen2.5-3B, the researchers ran experiments across 49 configurations, varying GPU frequencies between 306 MHz and 930.75 MHz and batch sizes from 4 to 28. Results showed that Camel consistently outperformed default settings, reducing the energy-delay product by 12.4% to nearly 30%. As it turned out, the optimal configurations were not simply at the extremes of maximum speed or minimum power, but at carefully balanced midpoints.

This work demonstrates that real-world edge AI requires more than brute force — it requires intelligent tuning of parameters to meet application-specific goals. By reducing both energy consumption and latency, frameworks like Camel could make smaller, more efficient models practical for real-time use in mobile devices, wearables, and embedded systems.

nickbild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Latest Articles