Efficiency at the Edge

EfficientViT outperforms the best image segmentation models, yet is less computationally intensive, allowing it to run on edge devices.

Efficient semantic segmentation using a new technique (📷: H. Cai et al.)

Object detection algorithms are critical for applications like self-driving cars because they are the foundation of these vehicles' ability to navigate safely and autonomously. These algorithms are designed to identify and locate various objects in the vehicle's vicinity, such as pedestrians, other vehicles, cyclists, road signs, and obstacles. By accurately detecting and tracking these objects in real-time, self-driving cars can make informed decisions, predict potential hazards, and plan their trajectories accordingly. This capability is essential for ensuring the safety of passengers and other road users, as it allows the vehicle to react quickly and appropriately to dynamic and complex traffic scenarios.

The complex neural networks and other algorithms necessary for accurate object detection require a lot of processing power and memory. This is especially true when working with the sort of high-resolution images that are needed for applications where tracking accuracy is critical. But to be practical for many applications, these algorithms must be deployed to edge devices, which are limited in computing resources and power, and may struggle to meet these demands. Object detection pipelines certainly do exist for edge devices, but they tend to rely on lower-resolution images to minimize their computational complexity. Unfortunately, this reduction in resolution would not be appropriate for a self-driving car, for example.

Architecture of EfficientViT (📷: H. Cai et al.)

Present state-of-the-art models perform semantic segmentation of images using a vision transformer, which splits an image up into groups of pixels. Interactions between these groups of pixels are then learned, which serves as the basis for identifying and tracking objects. But this approach is not very efficient as the size of an image grows. As the pixel count increases, the amount of computation required grows quadratically.

This growing consumption of resources is unsustainable when building the low-latency, energy-efficient systems that will power the next generation of intelligent devices. Fortunately, a group led by researchers at MIT has been working on a solution to this problem. The result, which they call EfficientViT, is a far more efficient computer vision model for performing accurate on-device semantic segmentations. Not only is this model much less resource intensive, but it is also as accurate, or better, than the best models currently available.

EfficientViT builds on traditional vision transformer-based models by replacing the nonlinear similarity function that is normally used with a linear function. This allowed the team to adjust the order of operations in the algorithm to greatly reduce the computational complexity. As a result of this change, the required number of computations will only grow linearly as the size of an image increases.

Naturally this change did not come without some consequences. Linear functions cannot model the complex relationships that their nonlinear counterparts can, so the accuracy of semantic segmentations suffers. To compensate, a pair of additional modules were added to the pipeline. One is targeted at helping the model capture more local feature interactions, which was hindered by changing the similarity function. The other module focuses on multiscale learning, to help the model recognize both large and small objects. Importantly, these additions only resulted in a small increase in computational complexity.

The researchers conducted experiments to assess the performance of EfficientViT using popular segmentation benchmark datasets, including Cityscapes and ADE20K. They found that their new method ran up to nine times faster than state-of-the-art semantic segmentation models when running on edge computing hardware. It was also observed that EfficientViT performed at least as well as existing methods in terms of accuracy.

Moving forward, the team plans to scale up their model, and also experiment with applying their methods to other computer vision tasks, like classification. This could enable the development of a new class of efficient devices for applications in self-driving cars and medicine that once required impractically large amounts of computational resources.

machine learning

artificial intelligence

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Efficiency at the Edge

EfficientViT outperforms the best image segmentation models, yet is less computationally intensive, allowing it to run on edge devices.

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles