MonoCon Aims to Give 2D Computer Vision Systems a Better Understanding of Our 3D World

Offering, its creators claim, improved accuracy and increased throughput, MonoCon is now being scaled up — and adapted for robot arms.

Scientists from North Carolina State University and Wuhan University have come up with a technique designed to give artificial intelligence systems a better understanding of 3D space — even when they only have access to 2D imagery.

"We live in a 3D world, but when you take a picture, it records that world in a 2D image," explains Tianfu Wu, corresponding author and assistant professor at NCSU. "AI programs receive visual input from cameras. So if we want AI to interact with the world, we need to ensure that it is able to interpret what 2D images can tell it about 3D space. In this research, we are focused on one part of that challenge: How we can get AI to accurately recognize 3D objects — such as people or cars — in 2D images, and place those objects in space."

The team's approach, dubbed MonoCon, isn't entirely novel: Like its predecessors, it relies on placing 3D bounding boxes around objects in a 2D image and providing coordinates for each of the eight corners. With repeated training, the result is a system that can infer 3D shapes and sizes from 2D imagery.

"What sets our work apart is how we train the AI, which builds on previous training techniques," Wu claims. "Like the previous efforts, we place objects in 3D bounding boxes while training the AI. However, in addition to asking the AI to predict the camera-to-object distance and the dimensions of the bounding boxes, we also ask the AI to predict the locations of each of the box’s eight points and its distance from the center of the bounding box in two dimensions. We call this 'auxiliary context,' and we found that it helps the AI more accurately identify and predict 3D objects based on 2D images."

Based on the Cramér–Wold theorem, which is named for Harald Cramér and Herman Ole Andreas Wold, the system performs convincingly: Tested on the KITTI data set, MonoCon outperformed all other tested methods for extrapolating 3D data on automobiles found in 2D images. That supremacy didn't extend to other road users, however, with MonoCon offering only "comparable" accuracy when asked to identify pedestrians and bicycles — albeit with a higher inference throughput than the competition.

"Moving forward, we are scaling this up and working with larger datasets to evaluate and fine-tune MonoCon for use in autonomous driving," says Wu. "We also want to explore applications in manufacturing, to see if we can improve the performance of tasks such as the use of robotic arms."

The team's work has been published under open-access terms on the arXiv.org preprint server.

Gareth Halfacree
Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire: freelance@halfacree.co.uk.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles