Whisper Is a Python-Based Robust Open Source Multilingual Speech Recognition Network

Trained on 680,000 hours of audio, Whisper offers everything from real-time speech recognition to multilingual translation.

3 years ago • AI & Machine Learning / Python on Hardware

OpenAI has released a multilingual open source neural network dubbed Whisper that, the company claims, "approaches human-level robustness and accuracy" for speech recognition tasks.

"Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web," OpenAI says of its latest neural network. "We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English."

OpenAI's Whisper is a fully open source highly-robust automatic speech recognition network, ready to download now. (📷: Radford et al)

Whisper uses an end-to-end encoder-decoder Transformer model, in which the audio to be recognised is split into 30second chunks, converted to a visible spectrogram, then passed into an encoder; the decoder section then predicts the required text caption and adds tokens for language identification, phrase-level timestamps, and translation as and where required.

Compared to its rivals, Whisper was trained on an expansive dataset — something which gives it, OpenAI says, a 50 percent error reduction for zero-shot performance across diverse audio sources, but which it admits means it cannot beat other models which are specifically trained to excel at the LibreSpeech becnhmark.

"About a third of Whisper’s audio dataset is non-English," the company adds, "and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA [State Of The Art] on CoVoST2 to English translation zero-shot."

Whisper was trained on a 680,000-hour dataset including multiple languages and non-speech content. (📷: Radford et al)

To encourage use and further development of the network, OpenAI has released it to GitHub under the permissive MIT license. Training and testing took place on Python 3.9.9 with PyTorch 1.10.1, but the company says the code should be compatible with Python 3.7 and above alongside "recent PyTorch versions." The release includes five models: Tiny, Base, Small, Medium, and Large, with everything bar Large also available in English-only models and video RAM (VRAM) requirements ranging from 1GB to 12GB.

More information is available in the project blog post, which includes a link to the team's paper; a demonstration is also available on Google's Colab platform.

machine learning

artificial intelligence

Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire: freelance@halfacree.co.uk.

Whisper Is a Python-Based Robust Open Source Multilingual Speech Recognition Network

Trained on 680,000 hours of audio, Whisper offers everything from real-time speech recognition to multilingual translation.

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles