LookHere Uses Simple Gestures to Let Anyone Build High-Quality Object Models for Machine Learning
Designed to let non-experts guide a machine learning system to create quality object models, LookHere does exactly what it promises.
A pair of researchers from the Interactive Intelligent Systems Lab of the University of Tokyo have come up with a way to allow anyone to train a machine learning system to recognize objects — by simply waving them at a camera.
"In a typical object training scenario, people can hold an object up to a camera and move it around so a computer can analyze it from all angles to build up a model," explains first author Zhongyi Zhou. "However, machines lack our evolved ability to isolate objects from their environments, so the models they make can inadvertently include unnecessary information from the backgrounds of the training images.
"This often means users must spend time refining the generated models, which can be a rather technical and time-consuming task. We thought there must be a better way of doing this that’s better for both users and computers, and with our new system, LookHere, I believe we have found it."
The LookHere system aims to resolve two key problems with what the pair call "machine teaching," or the training of a machine learning model for a given task: the inefficiencies of teaching, which is often time-consuming and can require specialist knowledge; and the inefficiencies of learning, where "noisy" data — such as object models that include extraneous environmental data by mistake — can reduce the performance of the resulting system.
To prove the concept, the pair used LookHere to generate a series of models known as HuTics. Like previous approaches, the models are based on images captured by a camera as the user waves them around — but where the LookHere goes a step further is to incorporate the user's hand gestures into the processing stage, allowing for objects to be emphasized over background imagery in the same way a person might show another person what's important. In other words: point to an object, and the machine will understand where the focus of its attention should be.
"The idea is quite straightforward, but the implementation was very challenging," says Zhou. "Everyone is different and there is no standard set of hand gestures. So, we first collected 2,040 example videos of 170 people presenting objects to the camera into HuTics. These assets were annotated to mark what was part of the object and what parts of the image were just the person’s hands.
"LookHere was trained with HuTics, and when compared to other object recognition approaches, can better determine what parts of an incoming image should be used to build its models. To make sure it’s as accessible as possible, users can use their smartphones to work with LookHere and the actual processing is done on remote servers."
The result, the researchers claim, is a system, which can build models up to 14 times faster than rival approaches — and that could, in theory at least, be expanded beyond visual data to boost the quality of other forms of data such as sound and, in turn, boost the performance of the resulting machine learning model which results.
The team's work has been published under open-access terms in the Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST '22). The system's source code has been published to GitHub under the permissive Creative Commons Zero 1.0 Universal public-domain license.