OpenAI has released a multilingual open source neural network dubbed Whisper that, the company claims, "approaches human-level robustness and accuracy" for speech recognition tasks.
"Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web," OpenAI says of its latest neural network. "We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English."
Whisper uses an end-to-end encoder-decoder Transformer model, in which the audio to be recognised is split into 30second chunks, converted to a visible spectrogram, then passed into an encoder; the decoder section then predicts the required text caption and adds tokens for language identification, phrase-level timestamps, and translation as and where required.
Compared to its rivals, Whisper was trained on an expansive dataset — something which gives it, OpenAI says, a 50 percent error reduction for zero-shot performance across diverse audio sources, but which it admits means it cannot beat other models which are specifically trained to excel at the LibreSpeech becnhmark.
"About a third of Whisper’s audio dataset is non-English," the company adds, "and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA [State Of The Art] on CoVoST2 to English translation zero-shot."
To encourage use and further development of the network, OpenAI has released it to GitHub under the permissive MIT license. Training and testing took place on Python 3.9.9 with PyTorch 1.10.1, but the company says the code should be compatible with Python 3.7 and above alongside "recent PyTorch versions." The release includes five models: Tiny, Base, Small, Medium, and Large, with everything bar Large also available in English-only models and video RAM (VRAM) requirements ranging from 1GB to 12GB.