A team from the Massachusetts Institute of Technology (MIT) and Google's artificial intelligence (AI) arm has found a way to use visual transfer learning to help robots grasp and manipulate objects more accurately.
"We investigate whether existing pre-trained deep learning visual feature representations can improve the efficiency of learning robotic manipulation tasks, like grasping objects," write Google's Yen-Chen Lin and Andy Zeng of the research. "By studying how we can intelligently transfer neural network weights between vision models and affordance-based manipulation models, we can evaluate how different visual feature representations benefit the exploration process and enable robots to quickly acquire manipulation skills using different grippers.
"We initialized our affordance-based manipulation models with backbones based on the ResNet-50 architecture and pre-trained on different vision tasks, including a classification model from ImageNet and a segmentation model from COCO. With different initialisations, the robot was then tasked with learning to grasp a diverse set of objects through trial and error. Initially, we did not see any significant gains in performance compared with training from scratch – grasping success rates on training objects were only able to rise to 77% after 1,000 trial and error grasp attempts, outperforming training from scratch by 2%.
"However," the pair continue, "upon transferring network weights from both the backbone and the head of the pre-trained COCO vision model, we saw a substantial improvement in training speed – grasping success rates reached 73% in just 500 trial and error grasp attempts, and jumped to 86% by 1,000 attempts. In addition, we tested our model on new objects unseen during training and found that models with the pre-trained backbone from COCO generalize better. The grasping success rates reach 83% with pre-trained backbone alone and further improve to 90% with both pre-trained backbone and head, outperforming the 46% reached by a model trained from scratch."
The team's results speak for themselves: Given 50 attempts each, a robot using a randomized pre-trained model managed only 9 successes; one using the ImageNet model managed 11; one using only the backbone from COCO managed 15; while the one using both backbone and head from COCO achieved 23 successes.
"These results suggest that reusing network weights from vision tasks that require object localisation (e.g., instance segmentation, like COCO) has the potential to significantly improve the exploration process when learning manipulation tasks," the researchers write. "Pre-trained weights from these tasks encourage the robot to sample actions on things that look more like objects, thereby quickly generating a more balanced dataset from which the system can learn the differences between good and bad grasps. In contrast, pre-trained weights from vision tasks that potentially discard objects’ spatial information (e.g., image classification, like ImageNet) can only improve the performance slightly compared to random initialization."
More information on the team's work can be found on Lin's website, along with a copy of the paper and a link to a repository where the source code is to be published following the work's presentation at the 2020 International Conference on Robotics and Automation (ICRA 2020) in late May. Google's AI team has also published its own write-up.