No MoE Compromises

A new mixture of experts framework balances latency, energy, and performance considerations to allow LLMs to be deployed on edge hardware.

NVIDIA Jetson AGX Orin development kits used in testing the framework (📷: L. Jin et al.)

Rapid advances in artificial intelligence (AI) have been achieved in recent years through the development of breakthrough algorithms, improvements in AI-centric hardware, and a number of other factors. But one of the largest contributors to the success of these tools is also the least impressive from a technological standpoint. Some of the best models are the best — at least in part — because they are larger than the rest. 100 billion parameters works well? Great! Let’s try a trillion. Massive datasets enhance accuracy? Alright, then we’ll train our model on the entire text of the public internet.

This is hardly an elegant approach, and there is nothing remotely efficient about it. But hey — if it works, then go with it…right? The problem is that this approach can only take us so far. Models can only grow just so large before the cost of training them and running inferences becomes completely impractical. And once you have trained a model on the whole internet, well, what more is there to train it on? Sure, we can generate an infinite amount of synthetic data, but this approach tends to lead to hallucinations, and there are serious questions as to whether or not much of anything can actually be learned from fake data.

An overview of the approach (📷: L. Jin et al.)

Clearly we need to start exploring more efficient options to continue making forward progress. Moreover, to ensure privacy and enable real-time applications, we need smaller models that can run on edge computing hardware. One promising direction for addressing these challenges is the Mixture-of-Experts (MoE) framework, a method that dynamically activates only a subset of specialized submodels — or "experts" — for any given task. Unlike traditional models that rely on activating all parameters for every operation, MoE employs sparse activation, allowing AI systems to handle resource-intensive tasks more efficiently.

However, existing MoE frameworks are not without their limitations. Traditional methods often assume a uniform environment where all experts are equally capable and accessible. This assumption falls short in real-world edge computing scenarios, where edge devices vary widely in terms of computational power, energy efficiency, and latency. Moreover, selecting the optimal subset of experts becomes a daunting combinatorial problem, particularly when accounting for these attributes. Standard optimization techniques struggle to meet the competing demands of performance, latency, and energy consumption.

A team at Zhejiang University has recently introduced what they call Mixture-of-Edge-Experts (MoE²) to address the challenges of deploying large language models (LLMs) in edge environments. Unlike conventional MoE approaches, MoE² introduces a two-level expert selection mechanism tailored to the heterogeneous nature of edge devices. At a coarse-grained level, experts are selected using optimization-based methods to guarantee constraints on energy consumption and latency. At a fine-grained level, input-specific prompts are dynamically routed to the most suitable experts through a specialized gating network, ensuring efficient task handling.

MoE² enhanced accuracy across the board (📷: L. Jin et al.)

MoE² also addresses the inherent complexity of expert selection by leveraging key insights into the problem's structure. For instance, the optimality of gating parameters for the entire set of LLM experts extends to subsets, simplifying the training process. Additionally, the framework utilizes a discrete monotonic optimization algorithm to ensure that expert selection improves performance while adhering to system constraints.

The framework has been successfully implemented on NVIDIA Jetson AGX Orin development kits and edge servers equipped with NVIDIA RTX 4090 GPUs. Experiments validate its ability to achieve optimal trade-offs between latency and energy consumption while outperforming baseline models. By dynamically adapting to resource constraints, MoE² enables real-time applications like conversational AI, translation, and intelligent assistance to thrive in environments where traditional LLMs falter.

machine learning

artificial intelligence

tinyml

energy efficiency

Nick Bild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

No MoE Compromises

A new mixture of experts framework balances latency, energy, and performance considerations to allow LLMs to be deployed on edge hardware.

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles