NVIDIA’s ACE in the Hole
A suite of multimodal small language models just released by NVIDIA bring digital humans to edge devices, enhancing speed and privacy.
Thanks to the present boom in artificial intelligence (AI) that was triggered by advances in generative AI tools, NVIDIA is one of the hottest companies in the world. While they are best known for their ultra-powerful GPUs that fill up data centers and churn through trillions of calculations to make applications like large language models and text-to-image generators tick, they seem to recognize that the future of machine learning is on-device, not in the cloud. This present era of remote AI inferencing is something of a stopgap measure — we need the horsepower, yet the latency and privacy-related concerns associated with this architecture are quite limiting.
In a step toward a more portable and local future for AI, NVIDIA has just announced the release of some very capable small language models (SLMs). These models have characteristics normally only seen in resource-intensive algorithms that run on large clusters of machines with GPUs, yet they can run on far less powerful (and less costly) edge computing devices, like the NVIDIA Jetson line of single-board computers, or on a single, consumer-grade GPU. This is good news for the development of tools that utilize virtual agents, assistants, and avatars — especially where cost, speed, and privacy are key considerations.
The new SLMs are a part of NVIDIA ACE, a suite of technologies meant to facilitate the creation of digital humans. Of special interest is the Nemovision-4B-Instruct model — the first multimodal SLM from NVIDIA. This model enables digital humans to interpret visual imagery from the real world or a desktop computer and generate contextually accurate responses to queries. Built using NVIDIA VILA and the NeMo framework, the model incorporates techniques like distillation, pruning, and quantization to remain efficient while being compatible with a wide range of NVIDIA GPUs.
In addition, NVIDIA has developed large-context SLMs, such as the Mistral-NeMo-Minitron-128k-Instruct models, available in configurations with 2B, 4B, and 8B parameters. These models can process large volumes of data in a single pass, which simplifies complex tasks and enhances accuracy. Developers can balance speed, memory usage, and precision by optimizing the models for their specific requirements and hardware platform.
For creating realistic and engaging digital humans, NVIDIA has upgraded its Audio2Face-3D NIM microservice. This service synchronizes audio with facial animation in real-time, enabling lifelike interactions. Now offered as an optimized downloadable container, the tool introduces new configuration options for customization. It also includes the inference model used in NVIDIA’s “James” digital human, allowing developers to achieve high-quality facial animation.
To streamline the deployment of digital humans, NVIDIA has unveiled new SDK plugins and samples designed for the efficient orchestration of animation, intelligence, and speech AI models. These tools address the complexity of integrating multiple inputs and outputs required for advanced applications. The collection includes NVIDIA Riva for speech-to-text transcription, a Retrieval-Augmented Generation demo, and an Unreal Engine 5 sample application powered by Audio2Face-3D.
Developers can begin using these SDK plugins and samples today through NVIDIA Developer, making it easier than ever to create intelligent, responsive, and visually compelling digital humans.