Tiny-Align Finds Its Voice on Edge Devices
Tiny-Align streamlines ASR-LLM integration on edge devices, enabling the development of a new generation of responsive voice assistants.
The recent wave of large language models (LLMs) are so good at conversation that it did not take long for people to start asking for these text-based chatbots to be built into voice assistants. That should come as no surprise, given that talking is a much more natural way to communicate than pecking away at a keyboard. For a variety of reasons, commercial device manufacturers took quite a while before they started to fill this hole in the voice assistant market, which is still lagging behind customer demand.
Naturally, hobbyists stepped in and made things happen much more quickly. Yet, by and large, those homebrew voice assistants were pretty rough around the edges. Typically, they would leverage some automatic speech recognition (ASR) tool to transcribe verbal prompts to text, then forward that text into an LLM. However, this disjointed approach often fails when the audio input lacks corresponding text, or when mismatched pre-trained knowledge between the ASR and LLM causes performance degradation.
Technologies have rapidly improved since those early days, and we now have joint ASR-LLM models that integrate audio features directly into the LLM through a shared representation space. This integration allows the model to better understand and process personalized audio input, as it aligns speech features with language understanding in a unified manner.
However, existing ASR-LLM models are mainly developed using high-performance computing environments, making them too resource-intensive for deployment on edge devices, like voice assistants. Moreover, personalized audio-based assistance requires the model to adapt to the specific speech characteristics of individual users, necessitating efficient on-device training. This adaptation relies on end-to-end training, which aligns audio features from ASR with the language understanding capabilities of LLMs, a process called cross-modal alignment. Unfortunately, current methods for cross-modal alignment are computationally expensive, posing challenges for resource-limited edge devices.
To address this situation, a team led by researchers at the University of Notre Dame has introduced Tiny-Align, a novel resource-efficient framework for aligning ASR encoders with LLMs on edge devices.
At the heart of Tiny-Align is a novel projector design called BridgeFormer, which is based on a transformer encoder architecture that excludes positional encoding. This design provides a larger and more expressive embedding space compared to traditional multi-layer perceptrons or deep neural networks. BridgeFormer acts as a bridge between the ASR encoder and the LLM by transforming audio embeddings from the ASR encoder into a format that can be effectively processed by the LLM, ensuring tight integration.
Additionally, Tiny-Align introduces an instruction injection mechanism during inference, which further enhances the model's ability to generate high-quality outputs by embedding task-specific instructions directly into the processing pipeline. This mechanism boosts performance by improving the alignment between audio input and language generation.
The Tiny-Align system was evaluated using five diverse datasets from TalkBank and tested on various state-of-the-art LLMs (e.g., Llama-3.2-1B, Gemma-2-2B) and the wav2vec2 ASR model. Effectiveness was measured using ROUGE-1 and ROUGE-L scores, while efficiency focused on convergence time and resource usage. Tiny-Align achieved significantly faster and more stable training compared to baselines, converging within 400 epochs for ADReSS-IS2020 and 100 epochs for ENNI. It also demonstrated scalability across different dataset sizes and robust performance on resource-limited devices like the NVIDIA Jetson AGX Orin.
Furthermore, the inclusion of the instruction injection mechanism improved LLM comprehension of audio embeddings, further enhancing performance. Compared to baselines like NExTGPT and X-VILA, Tiny-Align consistently achieved better results with lower resource requirements, proving its efficiency for ASR-LLM alignment on edge devices.