AI2's OLMo 7B Is a "Truly Open Source" Large Language Model for Gen AI — Training Data and All

Released under the Apache 2.0 license, the model includes all weights, pre-training data, training data, and training code.

The Allen Institute for AI (AI2) has released what it claims is a "truly open source" large language model (LLM) and framework: OLMo, described as "state-of-the-art" and made available alongside its pre-training data and training code.

"Many language models today are published with limited transparency. Without having access to training data, researchers cannot scientifically understand how a model is working. It's the equivalent of drug discovery without clinical trials or studying the solar system without a telescope," claims OLMo project lead Hanna Hajishirzi. "With our new framework, researchers will finally be able to study the science of LLMs, which is critical to building the next generation of safe and trustworthy AI."

OLMo 7B, AI2 explains, is a large language model (LLM) built around the organization's Dolma data set, released with model weights for four variants at the seven-billion scale — hence its name — and one at the one-billion scale, each of which has been trained to a minimum of two trillion tokens. This puts it on a par with other leading LLMs, and should mean it delivers the same kind of experience in taking an input prompt and delivering a response built from the most statistically likely tokens — often, but not always, forming both a coherent and correct answer to a given query.

AI2 is going beyond releasing the model and its weights, though, and is also making the pre-training data, full training data, the code to produce said training data, training logs and metrics, more than 500 training checkpoints per model, evaluation code, and fine-tuning code available. This, it argues, will provide better precision than its closed-off rivals, and avoids the need to perform in-house training and the computational demand — and carbon output — that entails.

"This release is just the beginning for OLMo and the framework," AI2 claims of its launch. "Work is already underway on different model sizes, modalities, datasets, safety measures, and evaluations for the OLMo family. Our goal is to collaboratively build the best open language model in the world, and today we have taken the first step."

More information on the launch is available on the AI2 blog; OLMo itself is available on Hugging Face and GitHub under the permissive Apache 2.0 license.

Gareth Halfacree
Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire:
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles