AI Hear What You’re Saying

Using Edge Impulse's new suite of GenAI tools, it is easy to build a multilingual keyword spotting system that works around the world.

Nick Bild
11 months agoMachine Learning & AI
Multilingual keyword spotting is possible on an edge device (📷: Edge Impulse)

The most natural way for people to communicate is through speech. But when it comes to working with computers and other electronic gadgets, the best option is usually a keyboard, touchscreen, or a set of buttons. We are still a long way away from the computers of Star Trek that can understand and respond to any natural language instruction that we give them, but considering recent advances in artificial intelligence, we are closer than ever before.

In reality, we do not need systems that can understand any conceivable request to incorporate more natural, voice-based interactions for most applications, however. Simply recognizing a handful of keywords is sufficient to control your television, give instructions to a robot, or operate a home automation system. That is where keyword spotting comes in. These are machine learning algorithms that can be trained to reliably recognize a relatively small number of words. Because the scope of the problem is limited, the algorithms can run on inexpensive, low-power computing platforms, which makes them suitable for use in consumer electronics of all sorts.

But as Dmitry Maslov of Edge Impulse recently pointed out, producing a keyword spotting application that will work around the world can be a big headache. The device would need to be capable of reliably recognizing when each of the keywords was spoken in perhaps dozens of languages. This presents many problems, especially related to data collection and testing.

Maslov showed that these problems can be virtually eliminated using some tools recently unveiled by Edge Impulse, however. By using their new synthetic data generator, it is now possible to rapidly produce a large and diverse dataset consisting of synthetic voice samples in virtually any language. One only needs to specify the keyword and the number of samples to create, and OpenAI’s Whisper text-to-speech algorithm will generate them in a matter of seconds. This process can be repeated for each keyword and language.

The same technique can also be leveraged to create the background classes needed to train a robust model by generating random words that are not in the set of keywords. Furthermore, Edge Impulse is integrated with ElevenLabs, which makes it possible to produce other types of audio, like typical background noises from an office or city street, to include in the background class.

Once the synthetic dataset has been created, it can be used to train a machine learning keyword spotting pipeline using Edge Impulse’s standard suite of tools. In this case, Maslov added some preprocessing blocks to the pipeline to slice incoming audio into segments, then extract the most informative features from this data. Those features were then passed into a pre-trained MobileNetV1 0.1 neural network. By using transfer learning, it was possible to get good results with even a small training dataset.

In this case, only about four minutes of training audio was captured, yet the training classification accuracy rate topped 95 percent. The model testing tool, which uses data not included in the training process, confirmed this result with a reported accuracy rate of nearly 90 percent.

As a final step, the pipeline was deployed to an Arduino Nano RP2040 Connect development board. Running on a resource-constrained platform such as this demonstrates that the process is viable for developing low-cost consumer electronics. It looks like generative AI is not just for making funny cat memes after all.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles