Picture Perfect
This suite of tools can reduce common sources of errors in machine learning datasets and help to produce more accurate models.
As machine learning algorithms continue to advance, the need for good, accurately annotated datasets is becoming increasingly apparent. With less and less room for optimization of the models themselves, more attention is finally being turned to addressing issues with data quality. After all, no matter how much potential a particular model has, that potential cannot be realized without a good dataset to learn from.
Image classification is a common task for machine learning models, and these models suffer from a particular type of data problem called co-occurrence bias. Co-occurrence bias can cause irrelevant details to get the attention of a machine learning model, leading to incorrect predictions. For example, if a dataset used to train an object recognition model only contains images of boats in the ocean, the model may start classifying anything related to the ocean, such as beaches or waves, as boats. This happens because the model has learned that boats frequently occur in images with oceans, even though the relationship between the two is not necessarily causal.
A common method of dealing with this type of bias involves adding additional metadata to the images that focuses the model’s attention on only the relevant regions. This is commonly achieved by having humans provide pixel-level annotations manually to define any objects of interest in each image. For a large dataset, this is clearly an extremely time consuming and costly task. And for very large datasets, it can be altogether impractical.
A team led by researchers at the Japan Advanced Institute of Science have developed a new technique that may take a good deal of the pain out of annotating large image datasets for machine learning. They have developed an interactive tool that allows human annotators to identify relevant regions in images with just a click. They have also proposed an active learning strategy that seeks to reduce the number of annotations that are needed by only selecting the most informative samples for labeling.
Working under the assumption that the most informative images are those that share the most similarities with the common characteristics of the entire dataset, the researchers chose a Gaussian mixture model to select images for annotation. With the top scoring images identified, the next step is to use a custom, interactive application that was designed to make annotation as fast and simple as possible. A user need only to left click on regions of interest, and, if necessary, right click on areas that should be ignored by the model.
A study of 16 participants was conducted to assess how well this system achieved its stated goals. Several other leading tools for annotating images and selecting the most relevant images from a dataset were also evaluated to see how they compared with the new strategy. The annotation tool proved to be quite successful, with a 27% reduction in annotation time being recorded. It was also popular with users, with 81% preferring it over the other options.
However, it was noted that in spite of all of the efficiency gains, users still get distracted and lose focus when annotating thousands of images. This is a big concern, because the whole point of this exercise is to create a clean dataset, and inattentive individuals are much more likely to make mistakes. They hope to improve this situation by adding some new features, like a progress gauge, and occasionally recommending that the user take a break. With a bit of refinement, this tool may help in creating the next generation of machine learning models.