Computer vision-based machine learning models have revolutionized the field of artificial intelligence by enabling computers to analyze and understand visual data. This has led to the development of new technologies in diverse areas such as robotics, autonomous driving, and medical diagnostics. Two of the most popular types of computer vision models are classification and object detection models.
Classification models are designed to assign a predefined label or category to an input image as a whole. Object detection models, on the other hand, not only classify objects but also localize them within an image. These models identify multiple objects of interest and provide bounding box coordinates around them. These techniques alone allow systems to interact with the world in many useful ways, however, there are still plenty of use cases that they can not adequately address.
Consider, for example, a cooking robot that needs to find butter to complete a recipe. An object detection model could locate a stick of butter easily enough, but what if only a quarter of the stick was left? Or what if the butter was instead in a tub? This would present an object detection model with a lot of challenges, because it would need to be trained on a diverse, labeled dataset containing each possible form butter could be found in.
This is a big enough problem as it is, but when expanding the scope of a robot to a wider range of tasks, the amount of training data and annotation that would be needed quickly becomes impractical to collect. A material selection model offers the promise of overcoming these challenges by, instead of detecting the shape of a stick of butter, recognizing what material it is made of. Such a model could, in theory, detect butter whether it is in a stick, a tub, or spread on a piece of toast, all without requiring specific examples of each of these forms.
Unfortunately, because a material’s appearance can vary drastically due to lighting, the shape of the object, and other factors, material selection models have yet to live up to their promise. MIT researchers have recently made some progress on this front, however, with a processing pipeline that they call Materialistic. After a user selects a pixel or small region of an image, this tool can accurately locate everything else in the image made of that same material.
The team wanted Materialistic to be capable of matching up any arbitrary materials, not just a small predefined set. Since no sufficiently large datasets exist with images mapped to a fine-grained description of the locations of materials within them, a synthetic dataset was created in a simulated environment. A set of 50,000 images was generated containing 16,000 materials that were randomly applied to objects.
After training the model, it was tested to ensure that the work translated into accurate results in the real world. Unfortunately, it did not. But further examinations revealed that this was simply due to a distribution shift — the simulated data and real-world data were simply too different. It was found, however, that by adding a pretrained DINO computer vision model as the first step in the pipeline, to give the system an understanding of real-world scenes, the accuracy was greatly improved.
Using this final model, when a user selects a material in an image, Materialistic processes every other pixel and assigns it a score representing the certainty level that the material is the same as the target. When compared with existing material selection models, Materialistic was found to outperform them, with average accuracy rates of about 92%.
It has been noted, however, that the method struggles with very fine features, or thin strips of a material. This could be a result of the training data used, which did not contain many thin geometric structures. The researchers are working to improve this issue for a future release. But even as it stands today, Materialistic offers the promise of giving computer vision systems a better high-level understanding of the world.