If you are one who follows the latest advancements in machine learning and artificial intelligence, you have likely come to the conclusion that intelligent systems are not really so intelligent as the nomenclature implies. Current machine learning models may show excellent performance when classifying videos of dogs being given baths, for example. But the model would not be able to understand the abstract concept of “cleaning” and recognize it in other videos — such as in a video of someone mopping a floor.
Towards the goal of building machines that can reason about everyday actions, a group at MIT’s Computer Science and Artificial Intelligence Laboratory have proposed a new machine learning approach, inspired by human learning, that can identify common patterns among a diverse set of events.
Despite the ease with which humans can organize the world into abstract categories, this has been a very challenging problem computationally. Using a hybrid language-vision model, the researchers have been able to build a system that can perform as well as, or better than humans at two visual tasks — picking a video that conceptually best completes a set, and picking a video that does not fit with the others.
The approach has two main steps. First, a ResNet base network extracts features from video clips. These features are then fed into the Set Abstraction Module that predicts the common category of all videos in a set. This module incorporates information from a dataset the team constructed from a large body of text consisting of word embeddings, which capture context and relationships between words. Study co-author Mathew Monfort summarized the idea behind the approach in saying “Words like ‘running,’ ‘lifting,’ and ‘boxing’ share some common characteristics that make them more closely related to the concept ‘exercising,’ for example, than ‘driving.’"
It is hoped that models that can learn abstract concepts may eventually be able to be trained on less data. Data collection is currently a major hurdle in the field.