An international team of scientists has developed a wearable platform for silent speech recognition, capable of accurately detecting non-vocalized commands in English and Mandarin — by strapping a small infrared camera to your neck and filming your chin: SpeeChin.
"SpeeChin [is] a smart necklace that can recognize 54 English and 44 Chinese silent speech commands. A customized infrared (IR) imaging system is mounted on a necklace to capture images of the neck and face from under the chin," the team behind the device explains. "These images are first pre-processed and then deep learned by an end-to-end deep convolutional-recurrent-neural-network (CRNN) model to infer different silent speech commands."
The idea behind SpeeChin is to address a one of the biggest issues with voice recognition technologies: Their unsuitability for use in public, either due to privacy concerns or out of respect for those in earshot. Silent speech recognition, which allows the users to form the words without actively vocalize them, is a potential answer — but usually requires obtrusive cameras to film the user's face or a range of uncomfortable sensors fitted around the cheeks, lips, chin, and even to the tongue.
SpeeChin requires none of that: It's a simple necklace, built around an infrared camera and a filter housed in a 3D printed case, worn around the neck on a silver chain and connected to a Raspberry Pi for processing. The camera points directly upwards, capturing a terribly unflattering image of the wearer's chin, lips, and nose — an image that can be monitored for silent speech.
"There are two questions: First, why a necklace? And second, why silent speech," asks corresponding author Chen Zhang, assistant professor at Cornell. "We feel a necklace is a form factor that people are used to, as opposed to ear-mounted devices, which may not be as comfortable. As far as silent speech, people may think, 'I already have a speech recognition device on my phone.'"
"But you need to vocalize sound for those, and that may not always be socially appropriate, or the person may not be able to vocalize speech. This device has the potential to learn a person’s speech patterns, even with silent speech."
The infrared camera, chosen for its low cost, compact size, and high resolution compared to thermal cameras or depth-sensing cameras and its improved ability to segment the wearer from the background room, feeds into a data processing pipeline which corrects for angle and positioning, picks the start and end time of utterances based on the degree of mouth movement detected, then feeds segmented utterances into the SpeechNet end-to-end convolutional recurrent neural network (CRNN) model.
The results are impressive: Based on a dataset of 54 English commands relating to digits, punctuation, navigation, smartphone controls, and common voice assistant wake phrases, the system was able to correctly recognize the commands 90.5 percent of the time; this rose to 91.6 percent in a dataset of 44 Mandarin commands. The device even proved able to operate while the user was walking, though with a high variability between users — as low as 34.4 percent success to as high as 91.9 percent.
The team has suggested a range of possible directions for future work, including improving its capabilities when operating outdoors in direct sunlight, reducing the influence of clothes and long hair interfering with the camera's view, moving away from the Raspberry Pi to a lower-power microcontroller platform, and extending the system to recognize individual phonemes — allowing it to transcribe arbitrary sentences, rather than short utterances.
The paper on SpeeChin has been published in the Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, under closed-access terms.