This Will Leave You Speechless

A pair of neural networks translate silent speech into audio that can be used to control unmodified smart speakers and other devices.

Nick Bild
18 days agoMachine Learning & AI
Interacting with a smart speaker via silent speech (📷: N. Kimura et al.)

From smartphones to smart speakers, digital devices have become an integral part of our daily lives. We rely on these devices to carry out all manner of normal, everyday activities efficiently both in our homes and our workplaces. One of the most significant advances in recent years is the ability to operate these devices with speech. Voice interfaces are rapidly gaining popularity due to their convenience and ease of use.

According to a report by Allied Market Research, the global speech recognition market was valued at $6.39 billion in 2019 and is projected to reach $29.28 billion by 2027, growing at a CAGR of 21.6% from 2020 to 2027. This growth can be attributed to the increasing popularity of smart speakers, voice assistants, and the growing use of speech recognition in healthcare and banking.

However, the adoption of this technology has not been without its challenges. In public places, voice controlled systems can be an annoyance to those in the area, and background noise can significantly reduce their accuracy with respect to speech recognition. And when it comes to secret information, blabbing it out loud for anyone to overhear is clearly not the most secure approach. Another problem is that those with speech impairments often find it difficult or impossible to use voice-controlled devices.

A trio of researchers at The University of Tokyo have put forward a proposed solution to these problems in the form of a system that can recognize silent speech (i.e. going through the motions of speech without making any sound). Their device may not win any points for style, but it has shown itself to be reasonably accurate in translating movements of the mouth into speech. They achieved this goal using an ultrasonic imaging sensor to capture movements, which are then translated into speech with the help of a deep learning pipeline.

To prove the concept, a relatively large, handheld ultrasonic imaging sensor was held against the underside of the jaw. The ultrasonic waves penetrate into the oral cavity, after which they are reflected back to the sensor. A sequence of ultrasound images is captured in this way to assess what movements occur during periods of speech.

The sequence of images is fed into a convolutional neural network, where they are translated into a Mel-scale spectrum representation of the sound. This result is then processed by a second neural network that models the local and contextual information of the input sequence to improve the quality of the sound representation. This data is then converted into a digital audio file for playback using the Griffin Lim algorithm.

A group of participants was recruited to test the device and assess how well it works in real-world scenarios. Each participant was asked to silently speak a number of voice commands to an unmodified Amazon Echo smart speaker. The audio sequences produced by the neural networks were played through a computer’s speaker, and it was found that the smart speakers responded as expected to these commands.

While this system has clear utility for many use cases, the present form factor makes it impractical for many real-world applications. The researchers are at work, however, on a new version of the device. They hope to eventually be able to fit the ultrasonic imaging sensor into a relatively small strap that fits under the jaw. They plan to pair this with earphones so that the wearer can also receive private responses to their silent speech commands.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles