Canary Soars to New Levels of Accuracy

NVIDIA's Canary ASR model leads the pack in accuracy, supports multiple languages, and leverages the NeMo framework to simplify deployment.

(📷: NVIDIA)

Human-computer interfaces are rapidly becoming more intuitive and efficient. One major reason for these improvements is the rise of automatic speech recognition (ASR). This technology enables computers and other devices to convert spoken language into written text by analyzing audio inputs and extracting linguistic information to understand and transcribe speech.

ASR systems are most commonly designed using machine learning models that are trained on large datasets of spoken language and its corresponding text. These models learn patterns in speech such as phonemes, words, and phrases, and they use this knowledge to make predictions about what is being said.

These days, ASR systems are popping up in more devices by the day. Virtual assistants like Siri, Alexa, and Google Assistant, utilize ASR to enable users to interact with their devices using voice commands. The technology is also important in healthcare, where it is used to transcribe patient notes and medical records, improving efficiency and reducing the burden on overworked staff.

Average word error rate on the MCV 16.1 test sets (📷: NVIDIA)

When choosing an ASR system, it is crucial to remember the importance of accuracy. High speech recognition accuracy levels ensure that the system can understand and transcribe spoken language reliably, leading to a better user experience. When ASR models have lower accuracy levels, there can be many negative consequences. For most consumer electronics, this means frustrating experiences and seemingly inexplicable device behaviors. But in certain industries, such as healthcare or legal fields, inaccuracies can have much more serious implications by recording erroneous information that can lead to inappropriate decisions being made.

The next natural question is: what is the best ASR model available today? As with any technology, people have their favorites and disagreements abound. But what do the numbers say? Well, with the recent release of NVIDIA’s multilingual ASR model named Canary, we appear to have a new leader of the pack. Canary presently sits on top of the HuggingFace Open ASR Leaderboard with the lowest average word error rate (6.67 percent) of any tracked model.

Canary is able to transcribe speech in English, Spanish, German, and French. It is also capable of translation between English and three other languages. These features were made possible by incorporating NVIDIA’s efficient Fast-Conformer encoder and a custom concatenated tokenizer into the model architecture. This model was trained on 85,000 hours of public and in-house data, which, while it may sound like a lot, is far less than most competing models were trained on.

This new model has been released through NVIDIA’s NeMo framework, which simplifies the deployment of generative AI models, whether one is deploying them to the cloud or on-premises. Step-by-step instructions in the release announcement demonstrate how Canary can be used to transcribe audio with just a few lines of Python code.

The license for Canary is quite permissive if you are interested in using it for research, so it may be worth giving it a whirl if you are not entirely happy with your current ASR solution. It is a non-commercial license, however, so keep that in mind if you intend to build a product around it.

machine learning

artificial intelligence

speech recognition

Nick Bild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Canary Soars to New Levels of Accuracy

NVIDIA's Canary ASR model leads the pack in accuracy, supports multiple languages, and leverages the NeMo framework to simplify deployment.

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles