The Eye in Edge AI

Edge AI 2.0 has arrived with NVIDIA's VILA family of vision language models that bring advanced visual reasoning to low-power edge devices.

Nick Bild
12 days ago β€’ Machine Learning & AI
VILA brings high-performance vision language models to the edge (πŸ“·: NVIDIA)

Machine learning at the edge has enabled the development of many new applications for wearable, portable, and other resource-constrained hardware platforms in recent years. These applications have brought the power of artificial intelligence (AI) directly to the source of data collection. This has been crucial to the growth of the technology, as it eliminates the privacy-related concerns that arise when sending sensitive personal information to a cloud computing environment. Edge AI has also reduced the latency associated with that approach, making real-time applications possible.

But these advantages come with some limitations. Resources like processing power and memory are severely limited in edge hardware, so machine learning models must be downsized accordingly. Aside from impacting their accuracy, these measures also have traditionally had the effect of limiting the models to a narrow range of use cases. If multiple tasks needed to be handled, that would mean training different models, each with its own dataset. Needless to say, this process is time-consuming and highly restricts the ability of a given model to adapt to new situations.

Technology is progressing, however, and we now find ourselves on the cusp of the Edge AI 2.0 era. The generalized models, typified by large language models (LLMs), that contain a large body of knowledge about the world and can be used for many purposes, are now finding their way to the edge. The latest entrant into this new generation of models is a family of vision language models developed by researchers at NVIDIA and MIT called VILA. Ranging from 3 to 40 billion parameters in size, these models can be deployed to a wide range of hardware platforms. Once deployed, VILA is capable of performing complex visual reasoning tasks.

VILA was made possible by pretraining a traditional LLM with textual data to supply it with a broad base of knowledge about the world. This model was then supplemented by tokenizing image data and pairing it with textual descriptions of those images to further train VILA. By leveraging a high-quality data mixture in this way, the model was able to acquire advanced visual reasoning skills.

Of course the goal was not just to build a visual language model, but to build a visual language model that is suitable for deployment on constrained computing platforms. So in addition to limiting the size of the models in the VILA family, the team also employed a 4-bit activation-aware weight quantization, which reduces model sizes and increases inference speeds, but crucially has been proven to do so with a negligible drop in accuracy.

Speaking of accuracy, a battery of benchmarks showed that the 3-billion parameter model did not lose any appreciable amount of accuracy after being quantized, as compared to a 16-bit floating point version of the model. Moreover, it was demonstrated that VILA is very competent at both image and video QA tasks.

The TintChat inference framework has recently expanded its support to include VILA, so deployment should be a piece of cake. NVIDIA suggests the Jetson Orin line of edge computing devices as a target platform, as they offer options from entry level to high performance for a wide range of use cases.

VILA is completely open source, from the trained models to the code and data used to train them. There are also some tutorials available to help developers get up to speed with the technology quickly.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles