Breaking the Language Barrier with Machine Learning

Google Research has used self-supervised learning with fine-tuning to create a speech recognition model that recognizes over 100 languages.

With the ability to understand natural language, speech recognition algorithms are transforming the way we interact with technology. From industry to the smart home, this technology has the potential to improve efficiency and reduce human error.

One of the most significant uses of machine learning speech recognition algorithms is in the healthcare industry. Doctors and other healthcare professionals are using this technology to transcribe patient information, allowing them to spend more time focusing on patient care. In addition, speech recognition technology is also being used in medical research to analyze and transcribe large amounts of data, enabling researchers to draw conclusions more quickly.

The retail industry is also seeing the benefits of speech recognition algorithms. In stores, this technology is being used to improve customer service by providing voice-activated assistants to answer customer queries. It is also being used in the supply chain, where it can help with inventory management and order fulfillment.

Overview of the Universal Speech Model training pipeline (📷: Google Research)

However, these benefits are only available when speech recognition models recognize an individual’s spoken language. And there lies one of the biggest challenges with this technology — a lack of support for the world's many spoken languages.

Late last year, Google Research launched what they call their 1,000 Languages Initiative, which aims to build a machine learning model capable of recognizing the world’s one thousand most-spoken languages. They have apparently been hard at work on this goal, and have just announced that they have made significant progress. They have developed the Universal Speech Model, which has two billion parameters and was trained on twelve million hours of recorded speech. It cannot yet recognize 1,000 languages, but has surpassed the 100 mark — an important milestone on the way to the ultimate goal.

Reaching the goal of building a massively multilingual model has been elusive in the past for a few primary reasons. Training these models traditionally relies on large volumes of supervised data, in which pre-existing transcriptions accompany speech samples. However, this requirement severely limits the amount of data that is available for training, and generating transcriptions manually is very costly and labor intensive.

Another challenge lies in updating models after they have been trained to improve their performance, and to expand the number of recognized languages, over time. Completely retraining a model is very computationally inefficient and can also be very costly. To overcome this challenge, a model must be capable of incorporating updates in an efficient manner.

Comparing the Universal Speech Model with other approaches (📷: Google Research)

The Google Research team’s training pipeline sidesteps these issues by taking an approach that uses self-supervised learning with fine-tuning. Self-supervision is an initial step in the analysis pipeline that learns to assign labels to speech samples. With automatically generated labels in place, the downstream analysis can then use traditional supervised learning methods. In most cases, some amount of labeled data is also available for training, so the team also added a second step to the pipeline that can leverage this information to improve the model’s performance — even by including a relatively small amount of labeled samples.

The final step of the pipeline allows for fine-tuning of the model. By including a small amount of supervised data, it is possible to optimize the algorithm for specific downstream tasks, like automatic speech translation, for example.

The researchers trained a model using their technique, then fine-tuned it using YouTube Caption’s multilingual speech data. The dataset contains 73 languages, and only about three thousand hours of data per language — quite a small amount in the world of deep learning. Nevertheless, the model achieved a word error rate of less than 30% across all languages. That is a 32.7% lower word error rate than other present state of the art models.

Based on these encouraging results, the Google Research team believes that the methods they have outlined in this work will serve as the basis for the successful completion of their 1,000 Languages Initiative. The believe that this effort will help to make the world’s information universally accessible.

machine learning

speech recognition

Nick Bild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Breaking the Language Barrier with Machine Learning

Google Research has used self-supervised learning with fine-tuning to create a speech recognition model that recognizes over 100 languages.

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles