MvACon Gives Transformers Better 3D Understanding From 2D Cameras — with Minimal Overhead

Applicable to existing vision transformers, MvACon can deliver improved 3D object detection with barely any performance impact.

Researchers from North Carolina State University and the University of Central Florida, working with Ant Group and the OPPO US Research Center, have come up with a new approach to delivering 3D object detection from two-dimensional camera views: Multi-View Attentive Contextualization (MvACon).

"Most autonomous vehicles use powerful AI [Artificial Intelligence] programs called vision transformers to take 2D images from multiple cameras and create a representation of the 3D space around the vehicle," explains corresponding author Tianfu Wu, associate professor of electrical and computer engineering at North Carolina State University, of the team's work. "However, while each of these AI programs takes a different approach, there is still substantial room for improvement."

"Our technique, called Multi-View Attentive Contextualization (MvACon), is a plug-and-play supplement that can be used in conjunction with these existing vision transformer AIs to improve their ability to map 3D spaces," Wu continues. "The vision transformers aren't getting any additional data from their cameras, they’re just able to make better use of the data."

The project builds on Patch-to-Cluster Attention (PaCa), which was designed to allow transformer models to better identify objects within an image — applying the same approach to mapping a 3D space using multiple views from 2D camera systems. In testing, the team paired MvACon with three popular vision transformers and analyzed their performance using a six-camera view with impressive results.

"Performance was particularly improved when it came to locating objects, as well as the speed and orientation of those objects," Wu says of the team's findings. "And the increase in computational demand of adding MvACon to the vision transformers was almost negligible. Our next steps include testing MvACon against additional benchmark datasets, as well as testing it against actual video input from autonomous vehicles. If MvACon continues to outperform the existing vision transformers, we're optimistic that it will be adopted for widespread use."

The team's work is to be presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition next week; a PDF is available to download from the project website, and while a GitHub repository has been prepared for the project's code it had not yet been filled at the time of writing.

Gareth Halfacree
Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire:
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles