Google Releases a Dataset to Train Your AI on Natural Language

Google is an interesting entity in the tech world. We love what they give us, but we’re well aware these days of how our information is…

Google is an interesting entity in the tech world. We love what they give us, but we’re well aware these days of how our information is used to for their financial gain. But, even so, there is no denying two truths: they have the economic and technological heft to create innovative products, and they tend to make those advancements available to us common peasants. The most recent release to introduce moral dilemmas in us hackers is a dataset for training speech recognition in their TensorFlow platform.

The dataset, essentially a textbook to get an artificial intelligence up to speed on how to communicate with humans, can technically be used for any AI application. It’s built on a Flask, an open source Python framework, so it can be used in any compatible platform. But, realistically speaking, you’ll probably want to use it with Google’s TensorFlow system.

TensorFlow is, according to their website, an “open source software library for numerical computation using data flow graphs.” It gives users the ability to take advantage of neural network deep learning to improve results in whatever industry its users operate. Don’t misunderstand us — this kind of narrow AI usage is widespread, and in general it’s a good thing for consumers. Narrow AI like this helps the computers we work with better understand what we’re asking of them.

If this dataset is open-source and can be used to train any AI, what exactly is the significance of it? There are plenty of natural language-training resources out there. The answer lies in the actual speech (or text) patterns themselves. To teach an AI, especially a narrow AI like this, how to talk to us we have to give it something to learn from. Natural Language Processing (a huge part of AI research) relies on feeding the AI examples of natural language (as the name suggests). The problem is that most text-based data sets we have easy access to aren’t not “natural language.”

Traditionally, these data sets have come from open-source references like books that have that are no longer under copyright. The problem, of course, is that that doesn’t reflect how a normal person actually talks today. For this data set, Google took a mere 30 words, but had them recorded in a massive 65,000 utterances in order to cover the full gamut of human speech in a single language.

This dataset isn’t exactly unprecedented — it’s exactly what AI researchers use to train their neural networks. But, the key here is that it’s open source, and ready to plug into your TensorFlow (or other) AI. We’re looking forward to seeing how AI researchers, both hobbyist and professional, can utilize it to expand their creation’s understanding of natural language.

Related articles
Sponsored articles
Related articles