Putting Words in My Mouth

Microsoft's VALL-E text-to-speech synthesizer can convincingly mimc anyone's manner of speech given just a three second audio sample.

Nick Bild
23 days agoMachine Learning & AI
(📷: C. Wang et al.)

Text-to-speech technology has come a long way in recent years, with new advancements making the technology more accurate and natural-sounding than ever before. One of the biggest advancements in text-to-speech technology is the use of deep learning algorithms. These algorithms analyze large amounts of speech data and learn to mimic human speech patterns and nuances. This has led to a significant improvement in the quality of synthetic speech, however, successes have been limited due to the fact that training these models requires huge amounts of high-quality, clean data from a professional recording studio.

Common techniques, like crawling the internet to collect a large audio dataset, fall short in producing the type of high-quality data that is needed. The result of using such data is an algorithm with degraded performance. For this reason, most efforts to create synthetic speech have relied on a relatively small speech dataset. And these small datasets cause the final product to sound less natural, and to have a difficult time accurately reproducing the speech patterns of a particular speaker.

Researchers at Microsoft have recently announced a new text-to-speech synthesizer called VALL-E that was trained on a massive, high-quality audio dataset. The dataset was compiled by a team at Meta, and consists of over 60,000 hours of English language speech from upwards of 7,000 speakers. Leveraging this data, they were about to build and train a model that can generate natural-sounding speech, and leaning on the knowledge of thousands of speakers, they even showed it is possible to mimic the speech of a previously unseen speaker with as little as three seconds of sample audio.

Microsoft’s VALL-E neural codec language model builds on the EnCodec technology unveiled by Meta late last year, which is a neural network designed for high fidelity audio compression. Using that technology, VALL-E generates a set of audio codec codes from the input text and the audio sample of the speaker to mimic. The algorithm is then able, with the help of the training data, to match the speaker’s manner of speech with known samples and generate convincing simulated speech. The speaker's vocal timbre and emotional tone are preserved in the synthetic speech.

The VALL-E speech synthesis pipeline was put to the test in a series of experiments that compared it with the existing, state-of-the-art zero-shot text-to-speech system called YourTTS. Both models were provided with text to convert to speech, and the results were compared with a ground truth recording of the target speaker vocalizing the phrase. Human participants in the study were asked to compare each result with the ground truth recording to assess how similar they sounded. It was discovered that VALL-E significantly outperformed YourTTS with respect to speech naturalness and speaker similarity.

As of this writing, Microsoft has made the decision to not release their model publicly. Given that the system can convincingly reproduce anyone’s voice with just a short audio sample, it has a lot of potential for abuse. One could, for example, make a public figure or politician appear to have said something that they never did. To mitigate this risk, Microsoft indicated that they intend to build a model in the future that can detect if an audio sample was created with VALL-E. In the meantime, you can check out the impressive synthesized speech examples provided on GitHub.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles