If you have at least tinkered with training a machine learning model, then you probably know what it is like to achieve a high accuracy rate, only to see the performance of the model plummet when running against a different dataset. Training well-generalized models can be very challenging, and without good datasets to test against, it may not even be noticed that the model has been overfit on domain-specific parameters. This sort of problem can lead to you yelling at your smart speaker because it turned on your bedroom lights when you clearly asked it to turn on the living room lights... again.
Apparently having had just about enough shenanigans from their own smart speakers, a group of researchers from University College London, Nokia Bell Labs Cambridge, and the University of Oxford have created a new Automatic Speech Recognition (ASR) dataset to help deal with some common issues. The dataset, Libri-Adapt, contains 7,200 hours of English speech. It seeks to deal with three particular challenging scenarios encountered by ASR models — differing acoustic environments, variation in speaker accent, and heterogeneity of microphone hardware.
Libri-Adapt is based upon the Librispeech-clean-100 dataset. The samples from this dataset are recorded on six different microphones, with three different accents (US English, British English, Indian English), and with added synthetic background noises (rain, wind, laughter). Microphones tested include those on the MATRIX Voice and Seeed ReSpeaker development boards popular with hardware hackers.
To assess the effectiveness of the Libri-Adapt dataset at highlighting problems in ASR models resulting from varying conditions, the team used the Mozilla DeepSpeech2 model as a base, and further trained it with Libri-Adapt. For each test, data for one domain was held out of the training data, and the model’s Word Error Rate (WER) was then checked against the held out domain. The WER was found to be 11.39% when testing against a microphone that the model was trained on, but when swapping microphones, the WER climbed significantly to 24.18%. Similar results were observed when testing for shifts in speaker accent and background noise.
These findings highlight the problems ASR models face when training data is not sufficiently robust. Libri-Adapt holds promise for helping to remedy these issues and alleviate user frustration with flaky voice assistants. The dataset is open source and available on GitHub.