The Sound of Silence

EarSpy exploits the large speaker and sensitive accelerometer in smartphones to eavesdrop on phone calls with the aid of machine learning.

Nick Bild
1 year agoMachine Learning & AI
Overview of EarSpy approach (📷: A. Mahdad et al.)

Eavesdropping on telephone conversations has a long and complex history, with the practice dating back to the early days of the telephone itself. In the early 20th century, it was not uncommon for people to listen in on other people's conversations by attaching a “tap” to their telephone line. As technology has advanced, so too have the methods and tools used for eavesdropping on telephone conversations. Today, sophisticated electronic surveillance equipment and software can be used to intercept and monitor conversations on cell phones, often without the knowledge of the people on the call.

Rather than exploiting the network itself, attackers have increasingly turned their attention to a less secure target — individual phones. The most direct way to eavesdrop on a conversation is to exploit the phone’s built-in microphone. However, smartphone operating systems heavily restrict access to this hardware, making it a difficult target to compromise. This has led attackers to explore less obvious avenues to achieve their goals. Since the raw data from motion sensors can often be acquired without being granted explicit permissions from the operating system, methods to exploit this data have become a popular area of research.

Coinciding with this interest in zero-permission motion sensor data has been a trend in which smartphone manufacturers have started to include larger, more powerful speakers in place of the tiny ear speakers that were included in past models. A team led by researchers at Texas A&M University recently published their work that explores how these more powerful speakers would influence the ability to reconstruct speech from motion sensor readings. Combining these technological enhancements with machine learning, they showed that their system, called EarSpy, proves that eavesdropping on conversations may be possible without elevated permissions.

The upgraded ear speakers included on many newer smartphones have the effect of generating stronger vibrations in the body of the phone as they produce audio. Those stronger vibrations can be measured by the highly sensitive motion sensors (e.g., accelerometer, gyroscope) that are also virtually standard-issue on any modern handset. Since those vibrations are the direct result of the speech being reproduced by the speakers, it stands to reason that the information needed to reconstruct that speech is present in the motion data. All that remains to be done is to translate motion measurements back into speech.

That is a very difficult problem, however, with no obvious way to make the translation. For this reason, the team turned to machine learning, which would allow them to create a model that could use sample data to train itself to learn the associations between vibratory data and speech. Specifically, convolutional neural network classifiers were designed to detect the identity of a speaker, their gender, and also to recognize the motion signature of spoken digits.

EarSpy was evaluated in a series of trials in which a number of public audio datasets were played through the ear speaker of a modern smartphone. At the same time, a third-party app collected data from the accelerometer. The motion data was then analyzed with the researcher’s machine learning classification pipeline, and it was found to be capable of correctly determining the speaker’s gender in over 98% of cases. Moreover, EarSpy could recognize specific individuals correctly in better than 92% of cases. With respect to reconstructing speech, spoken digits were classified correctly about 56% of the time.

EarSpy showed that this emerging attack vector should be given some attention, as it has the potential to be used for malicious eavesdropping. However, because the volume of the new, larger speakers is greatly reduced when used as an ear speaker, it limits the effectiveness of the approach. The 56% average accuracy of digit detection would be expected to drop substantially when expanded to a larger set of words. At this time, reconstructed speech would be expected to be broken and to have a high degree of uncertainty associated with it. But this is still something to keep your eyes on, because as is usually the case, technologies tend to improve greatly over time.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles