Voice assistants are becoming embedded in our daily routines because of the convenience that they offer, and also due to their frictionless interfaces. But speaking to these assistants out loud is not a good option for every use case. Silent speech interfaces, on the other hand, allow for similar communications with electronic devices, but without the need for audible sounds. These interfaces use sensors that can detect the movements of our mouths, tongues, and throats, and then convert those movements into words or actions on a device, all without making a noise.
The possibilities of this technology are numerous. People with disabilities that limit their ability to speak, such as those with ALS or cerebral palsy, could use silent speech interfaces to communicate more easily with others. In addition, silent speech interfaces could be useful in environments where speaking aloud is not practical, such as a library or hospital.
The technology is also being explored as a potential tool for law enforcement and military personnel. In situations where silence is necessary, silent speech interfaces could allow individuals to communicate with one another without raising their voices.
Despite the potential of devices powered by such interfaces, they are rarely found in commercial applications. One of the big reasons that this is the case is due to the fact that existing solutions tend to have a small and inflexible vocabulary of words that they can recognize. That significantly limits their usefulness and the applications that they can be applied to.
A fresh, new idea from a team at The University of Tokyo has led to the development of a much more versatile silent speech recognition system that they call LipLearner. Rather than being locked into a small, fixed vocabulary, LipLearner can easily add user defined commands with minimal effort. This is made possible through a novel machine learning approach that enables few-shot learning of new words or phrases.
The researchers implemented their techniques using a smartphone. The smartphone’s camera is leveraged to capture images of the user, such that the movements of their lips can be analyzed. Contrastive learning was then utilized in a semi-supervised manner to train the initial model on public datasets, such that it would have a solid foundation of knowledge related to how silent lip movements translate into spoken words.
This is followed by a simple linear classifier that can be trained in just a few seconds, and with a very small number of samples, to recognize words that were not part of the model’s initial training. This allows users to modify the system such that it can recognize any custom commands, or even non-verbal lip gestures, that they would like it to understand. For previous systems to accomplish this goal, high-priced computer systems with powerful GPUs would be required, not to mention a lot of technical know-how and time.
To confirm that LipLearner would perform well under real-world conditions, the team collected a diverse dataset from different locations, with different lighting conditions, and while holding the smartphone in different ways. These tests revealed a very respectable F1 score of 0.89 in classifying 25 different commands. A user study, consisting of 16 participants was also conducted to gain an understanding of how users would feel about the training process required to teach the system new commands. In general users reported that it was simple to use, and some even commented that it was fun to work with. The accuracy of the silent speech recognition climbed as high as 96% on average when given just three samples to train the model.
At present, the researchers are working to tweak their smartphone app in response to feedback from the user study to make it a bit more intuitive. Perhaps the methods used by this team will enable all sorts of devices to be smarter and more customized in the near future.