Parakeet-TDT Really Flies

NVIDIA's Parakeet-TDT ASR model offers high speed and accuracy, processing 10 minutes of audio in a second with a low error rate.

Nick Bild
14 days agoMachine Learning & AI
Parakeet-TDT performs automatic speech recognition in real-time (📷: NVIDIA)

For most of us, speech is the most intuitive way to communicate our thoughts and to give instructions to others, whether those others are humans or computers. In order to speak to computers, automatic speech recognition (ASR) technologies are needed to convert spoken language into written text. ASR uses various algorithms and machine learning techniques to process and understand audio inputs, translating them into comprehensible, structured text.

One of the major areas where ASR is making a significant impact is in personal assistants and other smart devices. Voice-activated virtual assistants like Siri, Google Assistant, and Amazon's Alexa rely heavily on ASR to understand and respond to user commands. This technology allows users to interact with their devices more naturally and efficiently, improving the overall user experience.

ASR also plays a crucial role in improving accessibility for individuals with disabilities. For instance, people with visual impairments can use ASR to interact with digital content more easily, while those with motor impairments can use voice commands to control devices and applications. ASR can also help bridge language barriers by providing real-time translation services, enabling communication between people who speak different languages.

We recently reported on the release of NVIDIA’s Canary ASR model, and the state-of-the-art accuracy levels it provides, which is a crucial factor in creating a good user experience. But for some applications, speed is just as important as accuracy. In edge computing applications in particular, lightweight models can be highly desirable as they can run fast even on resource-constrained hardware platforms. It is for these use cases that NVIDIA’s Parakeet-TDT ASR model was designed.

Parakeet-TDT can quickly transcribe audio samples, with the ability to process 10 minutes of audio in about a second. This is an improvement over previous models, like Parakeet RNNT 1.1B. The HuggingFace Open ASR Leaderboard shows Parakeet-TDT to not only be 64 percent faster according to nine metrics, but also to be more accurate. These are very important factors for real-time ASR solutions running on minimal hardware.

The performance of this new model results largely from NVIDIA’s novel Token-and-Duration Transducer architecture. This enables greater accuracy and faster inference times than when conventional Transducers of similar sizes are utilized. In the case of Parakeet-TDT, this approach resulted in a word error rate of under 7.0.

Parakeet-TDT can be deployed to large cloud-based systems or edge computing platforms without much difficulty using NVIDIA’s NeMo framework. NeMo can be installed with a single line of Python code, and an audio sample can be transcribed with just three more lines of code. For step-by-step instructions, take a look at the release announcement on NVIDIA’s Technical Blog.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles