This Pose Is a Problem
A new 6D object pose estimation system uses sparse color-coding and a multi-stage pipeline for high accuracy on edge computing devices.
Everything from grasping and manipulation tasks in robotics to scene understanding in virtual reality and obstacle detection in self-driving vehicles relies on 6D object pose estimation. Naturally, that means this is a very hot area of research and development at present. This technology leverages 2D images and cutting-edge algorithms to find the 3D orientation and position of objects of interest. That information, in turn, is used to give computer systems a detailed understanding of their surroundings — a prerequisite for interacting with the real-world, where conditions are constantly changing, in any meaningful sort of way.
This is a very challenging problem to solve, however, so there is much work yet to be done. As it presently stands, traditional 6D object pose estimation systems tend to struggle under difficult lighting conditions, or if objects are partially occluded. These issues have been significantly mitigated with the rise of deep learning-based approaches, but these techniques have some problems of their own. They generally require a lot of computational horsepower, which drives up costs, equipment size, and energy consumption.
A trio of engineers at the University of Washington has built on the deep learning-based approaches that have been emerging in recent years, but with a few tricks included that eliminate the limitations of these approaches. Called Sparse Color-Code Net (SCCN), the team’s 6D pose estimation system consists of a multi-stage pipeline. The system starts by processing the input image with Sobel filters. These filters highlight the edges and contours of objects, capturing essential surface details while ignoring less important parts. The filtered image, along with the original, is fed into a neural network called a UNet. This network segments the image, identifying and isolating the target objects and their bounding boxes (the smallest rectangle that can contain the object).
In the next stage, the system takes the segmented and cropped object patches and runs them through another UNet. This network assigns specific colors to different parts of the objects, which helps in establishing correspondences between 2D image points and their 3D counterparts. Additionally, it predicts a symmetry mask to handle objects that look the same from different angles.
The system then selects the relevant color-coded pixels based on the earlier extracted contours and transforms these pixels into a 3D point cloud, which is a collection of points that represent the object's surface in 3D space. Finally, the system uses the Perspective-n-Point algorithm to calculate the 6D pose of the object. This determines the exact position and orientation of the object in 3D space.
This approach has a number of advantages. By focusing only on the important parts of the image (sparse regions), the algorithm can run fast on edge computing platforms while maintaining a high level of accuracy.
SCCN was put to the test on an NVIDIA Jetson AGX Xavier edge computing device. When evaluating it against the LINEMOD dataset, SCCN was shown to be capable of processing 19 images every second. Even with the more challenging Occlusion LINEMOD dataset, where objects are often partially hidden from view, SCCN was able to run at 6 frames per second. Crucially, these results were accompanied by high estimation accuracy levels.
The balance of precision and speed exhibited by this new technique could make it suitable for all sorts of interesting applications in the near future.
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.