A person’s face conveys a lot of information about them — it can reveal what they are feeling, or what activities they are engaged in. Attempting to understand a person's behavior without any information about their facial expressions or movements is going to paint an incomplete picture. Yet this is exactly what many of the smart devices that are taking an ever more prominent place in our daily lives try to do. Until facial cues can be detected and interpreted by these gadgets, we will be left with awkward and sometimes inappropriate human-computer interactions.
Unfortunately, this is not a particularly easy problem to solve. Most current attempts to understand human facial expressions use cameras, which are impractical for many use cases —especially for mobile devices — and also come with privacy concerns, hefty power consumption budgets, and the need to allocate significant compute resources for data processing. Researchers at Cornell University have moved the field significantly forward with their recent work that resulted in a device called EarIO that is capable of reconstructing facial expressions using only sound waves.
EarIO takes the form of a pair of earbuds, with each containing both a speaker and a microphone. The speaker emits an inaudible sound directed at the side of the face, and the returning reflections are recorded by the microphone. Encoded within these sound reflections are data points describing muscular movements of the face. The only question is how one would translate these audio signals into information about facial expressions.
Trying to understand this relationship and encode it in software in the traditional way would be highly labor-intensive and error prone, so the team turned to machine learning for the solution. They designed and trained a deep convolutional neural network to translate the audio reflection data into 52 parameters that represent different aspects of the human face. These parameters are designed to mirror those that are produced by the TrueDepth depth camera on an iPhone. In the end, this algorithm was able to identify complex relationships between muscle movements and facial expressions that cannot be identified by the human eye.
To confirm that EarIO was working as it was designed to, the team carried out a user study with sixteen participants. The participants were asked to wear EarIO earbuds and perform nine common facial expressions. A TrueDepth camera from an iPhone 12 was also pointed at the participants to record the ground truth facial expression data. The results were largely positive, showing that the system can accurately recognize facial expressions. It was also discovered that certain background noises (wind, road noise, background discussions) do not interfere with operation of the device.
One of the features that enables EarIO to perform so well is that the acoustic sensing method is very highly sensitive. Unfortunately, this feature is something of a double-edged sword — that sensitivity also causes the device to pick up very subtle, irrelevant changes in the environment, like small movements of the head. These distractions can lead the device to produce incorrect results. This must be worked out by the team before we see any widespread deployments of this technology. Another limitation of the present device is that it requires each user to capture 32 minutes of facial data to train the machine learning model before the first use. That is a pretty hefty ask for a commercial device.
Despite these shortcomings, EarIO does show a good deal of promise in these early stages. It has shown itself to perform well, all while preserving privacy and slowly sipping energy. And because of the relatively lightweight audio data that is used, only a moderate amount of computer power is necessary. All processing can take place on a nearby smartphone connected to the earbuds via Bluetooth. With any luck, the researchers will sort through some of EarIO’s present issues and we will see this technology integrated into many smart devices in the near future.