The LLaMA Is Out of the Bag
Meta AI's LLaMA model that enables GPT-3-like performance on smaller platforms has been leaked. Now laptops and Raspberry Pis can run LLMs.
We recently reported on Meta AI’s less-large Large Language Model (LLM) called LLaMA. After recent research suggested that smaller models trained on more data can actually outperform their larger counterparts, Meta AI seized on this opportunity to build a more accessible LLM. Whereas the top performing LLMs of today tend to have hundreds of billions of model parameters, LLaMA comes in varieties that range from 7 to 65 billion parameters. But the most important factor in creating these models is that they were trained on from 1 to 1.4 trillion tokens. This proved to yield performance like the big models, but without all the overhead.
And that overhead is massive. The amount of resources needed to run inferences on these LLMs is out of reach for all but the largest, and best funded, of organizations. Researchers and hobbyists that could fuel the next revolution in machine learning with access to these models are left out in the cold.
LLaMA promised to change all that, with options that perform as well as GPT-3 models, but can run on as little as a single GPU. However, this innovation came with some restrictions of its own. Meta AI open sourced their model, but in order to get the weights of the model that resulted from training on over a trillion tokens, one would need to be “qualified” and apply for access through a process that specified no criteria for qualification and offered no transparency. Send a request, maybe you get access, maybe you never hear anything ever again. Good luck.
Unless you happen to have a few million dollars burning a hole in your pocket and a good deal of expertise, the model alone is not of much use. The model weights are the critical piece of the puzzle. Fortunately for those that are chomping at the bit to play with LLaMA, it did not take long for the model weights to get leaked. And once they did, some really interesting things happened very quickly.
In a single overnight hacking session, one engineer got LLaMA up and running on a single M1 Apple MacBook Pro. No GPU required — the model runs entirely on the CPU. As you might expect from such a quick hack, it is still a work in progress, but there are enough instructions that someone technically inclined should be able to get it working on their own machine. A 4-bit quantization was performed on the model to pull off this feat, and the consequences of that are not yet fully understood, but demonstrations of the model in action do show that it appears to perform quite well.
The fun did not stop with the MacBook Pro. Other engineers got LLaMA running on Windows machines, a Pixel 6 smartphone, and even a Raspberry Pi. Granted, it runs very slowly on the Raspberry Pi 4, but considering that even a few weeks ago it would have been unthinkable that a GPT-3-class LLM would be running locally on such hardware, it is still a very impressive hack.