Many major advancements in machine learning algorithm design have fueled a revolution in the field over the past decade. As a result, we now have models that are so impressive that the elusive goal of developing an artificial general intelligence seems like it may become more than just science fiction in the near future. But to continue forward progress, some of the attention that has been focused on designing better models needs to be redirected to creating higher-quality datasets.
High-quality data is critical for building accurate machine learning models. The accuracy and effectiveness of a model depend largely on the quality and quantity of data used to train it. Machine learning algorithms rely heavily on patterns and trends in data, and the more data available for training, the better the accuracy of the model. However, simply having large amounts of data is not enough; the data must also be of high quality, meaning that it should be accurate, relevant, and reliable.
Algorithm design may seem like the more interesting part of the process, with dataset generation just being a necessary evil. But consider the phrase “data is the new code” that is being heard with increasing frequency among AI researchers. From this perspective, the model serves only to determine the maximum possible quality of a solution. But without data to “program” it, it is not of much use. It is only through good, appropriate dataset selection that a model can learn relevant patterns that accurately encode relevant information.
Towards the goal of improving datasets, and the methods involved in creating them, members of Google Research have collaborated on a project called DataPerf. Through gamification and standardized benchmarking, DataPerf seeks to encourage advancements in data selection, preparation, and acquisition technologies. Several challenges have also been launched to help drive innovation in a few key areas.
Some of the areas that the team would like to see addressed initially are dataset selection and dataset cleaning. With many sources of data to choose from for many problems, the question becomes, which of them should be used to build an optimally trained model for a particular use case. Data cleaning is also critical because it is a well known problem that even very popular datasets contain errors, like mislabelled samples. These types of issues wreak havoc when training an algorithm, but with such large volumes of data, these problems cannot be uncovered manually. For this reason, it is critical that an automated method be developed to detect the samples that are most likely to be mislabeled.
A related question is how do we determine the quality of a dataset? As we implement new methods, how can we be sure that we are moving in the right direction, and by how much? Such a tool will need to be developed to assess new techniques, but as the team pointed out, it could also be very valuable for another reason. High-quality data is going to become a product that is sought after in many industries, so if you can prove that your dataset is among the best, it may command a higher price.
The present DataPerf challenges span the computer vision, speech, and natural language processing domains at present. They focus on data selection and cleaning, and also dataset evaluation. The first round of these initial challenges closes on May 26th, 2023, so be sure to get started on your entry right away if you have an interest in optimizing machine learning algorithms.