Fake It Till You Make It

NVIDIA’s Nemotron-4 340B generates high-quality synthetic text-based data for training LLMs, greatly simplifying data collection efforts.

Nick Bild
1 month agoMachine Learning & AI
Nemotron-4 340B simplifies training LLMs (📷: NVIDIA)

Advances in artificial intelligence (AI) are rapidly changing the world around us, but there are still a number of nagging problems that are slowing our rate of progress. One of those issues is undoubtedly the vast amount of resources that many state-of-the-art AI algorithms require for computation. However, there does seem to be a reasonably clear path forward to eventually solve these problems. As hardware increases in power and drops in price these algorithms will become available to larger audiences. Moreover, model optimizations that enable algorithms to execute more efficiently are frequently being developed.

A much more difficult problem to solve, with less obvious solutions, involves data collection. Today’s cutting-edge AI algorithms require vast amounts of training data to gain knowledge. And collecting the amount of data that is needed, not to mention annotating it, can be a massive undertaking — so much so, that it can sink an entire project.

When data collection becomes impractical, developers are increasingly turning to a new approach in which synthetic datasets are produced. As long as this synthetic data captures the variability of real-world data accurately, a model trained from it will function just as well as another that was trained with real data that was painstakingly collected.

A number of notable tools, like NVIDIA’s Omniverse Replicator are available for producing synthetic images or 3D scenes. But if you want to train a large language model (LLM), for example, you may still find yourself trying to scrape the text of the entire internet. Needless to say, this is a huge effort and it is also fraught with copyright concerns. Training LLMs may just have gotten easier, however, with the release of NVIDIA’s Nemotron-4 340B family of open models that generate synthetic text-based data.

Nemotron-4 340B includes base, instruct, and reward models that form a pipeline for generating synthetic data. The high-quality data it produces can be used both to train and refine LLMs. As part of the NeMo end-to-end platform for developing custom generative AI applications, the data generator is easy to include in any project. And with TensorRT-LLM integration, the production-ready models can be optimized to minimize the computational resources that are required to run them and cut costs.

The Nemotron-4 340B base model was trained on 9 trillion tokens to embed a vast knowledge of language into it. Accordingly, the pipeline can produce realistic and diverse synthetic data that closely mimics the characteristics of real-world data.

The Nemotron-4 models are now available for download from Hugging Face, so go grab them if you need lots of training data with little effort.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles