You've Got Some Explaining to Do

Researchers evaluated techniques that tell us how well ML models zero in on relevant bits of data, and the results are not encouraging.

Nick Bild
2 years agoMachine Learning & AI

When training a machine learning model to recognize the difference between a cat and a dog, one may be quite satisfied that their model performs well by taking the validation accuracy metrics at face value. But when the stakes associated with the model’s performance are raised, for example in discerning between tumors and normal tissue, a greater level of assurance is needed. In such cases, researchers often turn to feature attribution methods that are designed to identify which parts of the image are most important to the model’s predictions.

These feature attribution methods, in theory, will reveal if that tumor classifier has homed in on features of the abnormal tissue, or has instead noticed irrelevant features, like watermarks, borders, or digital ruler marks. But this raises an important question — since we do not know which features are important a priori, how do we know if the feature attribution method is missing features that are important to the model?

A research team at MIT CSAIL and Microsoft Research recently devised a technique that allowed them to evaluate which features are truly of importance to a model. They did this by modifying the original dataset that a model was trained on. This modified dataset was then used to evaluate three feature attribution methods: saliency maps, rationales, and attentions.

The dataset is modified to intentionally weaken any relations between the original images, and their true classes. Doing so ensures that the original features will no longer be of importance. The next step is to add new features to the image. These new features are designed to be very obvious — like large rectangles, with different colors representing each class — such that the model must focus on it to make its prediction. With data modified in this way, it is used to test feature attribution methods. You would expect, under these conditions, that the methods would all highlight these obvious regions, rather than any other regions that would not make meaningful contributions to the classifications.

The results of the team’s investigations were very discouraging. Given what should be a very simple task, it would be expected that nearly one hundred percent of features would be very near to the manipulated region. It was found that none of the methods performed well, with most struggling to achieve the 50% mark, which would be expected if regions were chosen randomly — and some methods even performed more poorly than this random baseline.

Revealing that widely-used feature attribution methods do a poor job of locating truly important features, and also frequently highlight unimportant features, shows that our trust in the current state of the art for feature attribution has been misplaced. It seems that what we have been offered is nothing more than a false sense of confidence in our models. The researchers suggest that their methods be adopted in developing and evaluating future feature attribution methods, and that sounds like a very reasonable suggestion that may renew our trust in these methods, and ensure that our models are actually doing the jobs we designed them for.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles