Real-Time Translation for the Real World

AI-powered Spatial Speech Translation isolates voices in noisy spaces, making real-time multilingual conversations more practical than ever.

Nick Bild
5 months agoCommunication
This real-time translation system ignores background noises (📷: Paul G. Allen School)

One of the more promising applications of artificial intelligence is real-time language translation. A device that can translate a speaker’s words from one language to another has the potential to break down communications barriers and bring people together. In this day and age, language translation is not an especially difficult problem to solve. As such, a number of real-time translation devices have been developed. But yet they are very rarely encountered in the wild.

Why would such an important and useful tool be so sparingly used? Well, a major reason we so infrequently see them is that, in practice, they require nearly ideal conditions for operation. Two speakers in an otherwise quiet room may have a very good experience with such a device, especially if they are careful to take turns speaking. But in a noisier environment, where there are lots of background sounds and other speakers — you know, like you find pretty much everywhere outside of your home — the devices cannot hone in on one specific speaker’s voice. As a result, the transcriptions they produce are often a jumbled mess.

Engineers at the University of Washington have been hard at work trying to solve this problem and build a more practical real-time translation system. Their new approach, called Spatial Speech Translation, aims to succeed where other systems fall short. Unlike other tools that work best with a single speaker and minimal background noise, this new system is designed to work in real-world settings like museums, airports, or city streets.

The team’s solution combines off-the-shelf noise-canceling headphones and binaural microphones with powerful on-device AI models. These models can identify and isolate multiple speakers around the user in real time, determine where each voice is coming from, and then translate the speech — all while preserving the unique qualities of each voice, such as tone and direction. This helps listeners know who is speaking, what they are saying, and where the speaker is in relation to them.

The prototype system currently runs on Apple’s M2 chip, which is available in devices like laptops and the Apple Vision Pro headset. This on-device processing approach was leveraged to address the privacy concerns that would otherwise crop up when sending one’s private conversations to the cloud.

Testing in 10 real-world locations, both indoors and out, showed that the system could effectively separate and translate speech even in crowded environments. In a study involving 29 participants and over 350 minutes of real-world use, users preferred this spatial translation system over existing models, citing improved clarity, voice similarity, and spatial accuracy.

The system introduces only a brief delay of 2-4 seconds, which users found acceptable. Faster translations were possible but came with more errors, so the researchers are working on further reducing the delay without sacrificing accuracy. The current system supports Spanish, French, and German translations to English but is based on models that can be extended to over 100 languages.

With further development, Spatial Speech Translation could enable travelers, immigrants, and everyday users to understand and engage in conversations they otherwise could not, and without losing the subtle spatial and emotional cues that make communication feel human.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles