Hey! Eyes Over Here, Buddy!
A novel training method gives vision transformers the ability to understand a visual scene in an efficient and human-like way.
For us humans, picking out the most important visual features in a scene just comes naturally. If there is someone standing in front of us talking, we direct our gaze at them, not at the trees in the background. But for machines, nothing comes naturally. When their cameras snap a picture, all they “know” is that there are millions of individual, colored pixels to examine. Computationally exploring all the pixels in the image, at different scales, is a very inefficient way to find important elements, so better methods are needed.
In recent years, methods like saliency models, convolutional neural networks and vision transformers (ViTs) have emerged. These approaches have shown some promise, yet, in one way or another, they fail to emulate human-like visual attention patterns. But recently, a trio of researchers at The University of Osaka had an idea that could change all of this. They found that ViTs may be capable of learning human-like patterns of visual attention, but only if they are trained in just the right way.
The researchers discovered that when ViTs are trained using a self-supervised technique known as DINO, they can spontaneously develop attention patterns that closely mimic human gaze behavior. Unlike traditional training approaches that rely on labeled datasets to teach models where to look, DINO allows a model to learn by organizing raw visual data without human guidance.
To test their theory, the team compared human eye-tracking data with the attention patterns generated by ViTs trained using both conventional supervised learning and the DINO method. They found that DINO-trained models not only focused more coherently on relevant parts of the visual scene, but actually mirrored the way people look at videos.
This behavior was especially noticeable in scenes involving human figures. Some parts of the model consistently focused on faces, others on full human bodies, and some directed attention to the background — mirroring how human visual systems differentiate between figures and their environment. Researchers labeled these three attention clusters as G1 (eyes and keypoints), G2 (entire figures), and G3 (background), noting a strong resemblance to the way people naturally segment visual scenes.
Traditional models like saliency maps and deep learning gaze predictors often fall short, either because they depend on handcrafted features or because they lack biological plausibility. But DINO-trained ViTs appear to overcome these issues, suggesting that machines might be capable of developing human-like perception with the right training approach.
This work opens the door for more intuitive AI systems that align more closely with how humans see the world. Potential applications range from robotics and human-computer interaction to developmental tools for children and assistive technologies.