Let's Get This Straight
Towards developing more human-like computer vision, MIT researchers have evaluated dozens of models to understand which perform the best.
With modern advances in machine learning algorithms and processing units, computer vision systems have made significant progress in their ability to recognize and track objects in motion. However, there are still key differences between human visual perception and computer vision that limit the latter's ability to understand the world in a stable and predictable way.
Humans have the remarkable ability to quickly recognize the key features of an object and create an internal, generalized representation of it. That generalized representation is key in our ability to perceptually straighten visual information over time. As such, we can effortlessly track an object over time, and predict what its future trajectory will be.
Machines, on the other hand, work very differently and rarely exhibit any properties resembling perceptual straightening. Machine learning algorithms must learn to recognize objects from the information contained in a high-density grid of pixels. This pixel data can vary wildly between frames of a video, even when looking at the same object, due to differing angles of rotation or lighting, for example. And all those differences make it challenging for an algorithm to develop a consistent, stable representation of that object over time.
Recognizing the tremendous value that perceptual straightening affords us in understanding the world, a team of MIT researchers began working to better understand why computer vision systems lack this capability, and how they might learn to see the world in a more human-like way. Towards this goal, they evaluated how different model architectures, training schemes, and purposes relate to perceptual straightening in time sequence images.
First, they defined a method to measure the representational straightness between consecutive frames of a video sequence. This formula calculates a curvature, which is defined as the angle between the vectors representing the difference between a pair of image frames. The curvature metric can be calculated for either the pixel-level input data, or at any other layer of a neural network, to understand the representations at each level.
Dozens of neural network architectures, trained on a variety of different datasets, were evaluated by calculating curvature at the input and output level. This would reveal which types of models and other conditions reduce curvature (and therefore, perceptually straighten) visual information the most.
Of all the factors tested, it was found that adversarial training methods led to the most significant reduction in differences between representations over time. During the adversarial training process, small changes are made to pixels. These changes, while subtle to a human, are enough to fool many machine learning algorithms. But by intentionally introducing them, the resulting models tend to be more robust against small differences. And that tolerance for differences allows the models to create more general representations, and therefore, have greater perceptual straightening capabilities.
Another key finding is that the purpose of the model is also important. In particular, they found that algorithms designed to classify entire images had lower curvature than those that were designed to localize specific objects in an image. This again has an explanation in generality β models seeking to detect individual objects must have a finer-grained representation of each object, whereas classifying an entire image successfully requires that the model be resilient to a good deal of background noise.
Aside from these few findings, the vast majority of models and other parameters tested did a very poor job of reducing curvature over time. The team hopes that future work building on their findings will help us to better understand what elements we can include in our models to achieve computer vision that is much more human-like. They envision such systems being used in future self-driving cars, to help them predict the future path of pedestrians, cyclists, and other vehicles.