Don't Make Me Repeat Myself

A new technique improves keyword spotting accuracy with a fine-tuning process that can run on even highly resource-constrained devices.

Nick Bild
4 months agoMachine Learning & AI

Keyword spotting technologies are used to identify specific words or phrases within a stream of audio. This capability has found applications in many fields, including voice-controlled devices, virtual assistants, security systems, and speech-to-text transcription services. By recognizing keywords or phrases, these systems can trigger specific actions, responses, or alerts, providing convenience and efficiency to users.

However, the accuracy of keyword spotting systems can be substantially reduced by environmental factors such as background noise or variations in the speaker's voice. For instance, if the system has been trained on a limited dataset that does not encompass diverse backgrounds, accents, or speech patterns, it may struggle to accurately recognize keywords. Additionally, speech disorders or unusual manners of speaking can further challenge the system's accuracy.

Traditionally, addressing these challenges would involve designing larger models and training them on more extensive datasets to improve generalization. However, larger models may not be suitable for the resource-constrained devices commonly used for running keyword spotting algorithms. These devices may lack the computational power or memory capacity to accommodate such models.

One potential solution to this issue is on-device training, which involves fine-tuning the model for a particular use case directly on the device. However, conventional on-device training methods can be resource-intensive, making them impractical for many devices. A trio of engineers at ETH Zurich and Huawei Technologies have developed a new technique that enables fine-tuning of keyword spotting models on-device, even when the device is highly resource-constrained. Using this method, even an ultra-low-power microcontroller with about 4 KB of memory is sufficient for model fine-tuning.

Current on-device training schemes rely on memory- and processing-intensive updates to the backbone of the model. In this work, the team instead froze the lightweight, pre-trained backbone of their model, such that these weights do not need to be altered during training. This model instead makes use of user embeddings, which are representations of speech data in a lower-dimensional space that capture important features. In particular, these embeddings are used to capture unique characteristics of a user's speech patterns. This feature can tailor its recognition capabilities to an individual user and improve accuracy, and it is also much less computationally-intensive to update during a retraining process.

An experiment involving six speakers was conducted to determine how well the model could adapt to a new user. In each case, the process started by leveraging the original pre-trained keyword spotting model. Then that model was retrained using between 4 and 22 additional voice samples per class, with between 8 and 35 classes being provided. In all cases, the model accuracy was observed to increase by updating only the user embeddings. In the best case, an error reduction of 19 percent was obtained.

Requiring only about one megaflop of processing power and under 4 KB of memory for a retraining epoch, this system has proven that it is feasible for execution even on highly resource-constrained systems. And given the accuracy increases that were observed, it could find useful applications in a number of keyword spotting devices. In the future, we may less frequently be frustrated with our devices that just cannot seem to understand us, no matter how many times we repeat ourselves.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles