This article will guide you on how to deploy and run popular LLMs (Large Language Models) on the Lattepanda 3 Delta 864, including LLaMA, LLaMA2, Phi-2, and ChatGLM2. We will compare the differences in runtime speed, resource consumption, and model performance among these LLMs to assist you in selecting a device that meets your needs and to provide a reference for AI research with limited hardware resources. Additionally, we will discuss the key steps and considerations to help you experience and test the performance of LLMs on the Lattepanda 3 Delta 864.
How to Choose LLMLLM usually puts forward the prerequisite requirements for CPU/GPU in the project requirements. Since GPU inference for LLM is not currently available on the Lattepanda 3 Delta 864, we need to prioritize models that support CPU. Due to the RAM limitations, we should give preference to smaller models. Generally, a model requires RAM that is double its memory size to operate smoothly. Quantized models have lower memory demands. Therefore, we recommend using quantized models to experience the performance of LLMs on the Lattepanda 3 Delta 864.
The following list is a selection of smaller models from the open_llm_leaderboard on the Huggingface website and the latest popular models.
P.S.
1.ARC(AI2 Reasoning Challenge)
2.HellaSwag(Testing the model's common sense reasoning abilities)
3.MMLU(Measuring Massive Multitask Language Understanding)
4.TruthfulQA(Measuring How Models Mimic Human Falsehoods)
How to run LLMWe used LLaMA.cpp and the CPU of the Lattepanda 3 Delta 864 to infer LLMs. Here, we will take ChatGLM-6B as an example to provide you with detailed instructions on how to deploy and run an LLM on the Lattepanda 3 Delta 864, which has 8GB RAM, 64GB eMMC, and is running Ubuntu 20.04.
QuantizationThe following is the process of quantizing ChatGLM2-6B 4bit via GGML on a Linux PC:
The first section of the process is to set up ChatGLM.cpp on a Linux PC, download the ChatGLM-6B-int4 model, convert and copy it to a USB drive. We need the Linux PC’s extra power to convert the model as the 8GB of RAM in a delta864 is insufficient.
Clone the ChatGLM.cpp repository into your local machine:
git clone --recursive https://github.com/li-plus/chatglm.cpp.git && cd chatglm.cppIf you forgot the --recursive flag when cloning the repository, run the following command in the chatglm.cpp folder:
git submodule update --init --recursiveInstall necessary packages:
python3 -m pip install -U pippython3 -m pip install torch tabulate tqdm transformers accelerate sentencepieceCompile the project using CMake:
sudo apt-get install cmake
cmake -B build
cmake --build build -j --config Releasepip uninstall transformers
pip install transformers==4.33.0Download the model and other files to chatglm.cpp/THUDM/chatglm-6b: https://huggingface.co/THUDM/chatglm-6b-int4
Use convert.py to transform ChatGLM-6B into quantized GGML format. For example, to convert the fp16 original model to q4_0 (quantized int4) GGML model, run:
python3 chatglm_cpp/convert.py -i THUDM/chatglm-6b -t q4_0 -o chatglm-ggml.binModel DeploymentHere is the process of deploying and running ChatGLM-6B-q4 on Lattepanda 3 delta 864 Ubuntu 20.04:
git clone --recursive https://github.com/li-plus/chatglm.cpp.git && cd chatglm.cppgit submodule update --init --recursivepython3 -m pip install -U pippython3 -m pip install torch tabulate tqdm transformers accelerate sentencepiecesudo apt-get install cmake
cmake -B build
cmake --build build -j --config Releasepip uninstall transformers
pip install transformers==4.33.0To run the model in interactive mode, add the -i flag. For example:
cd chatglm.cpp
./build/bin/main -m chatglm-ggml.bin -iIn interactive mode, your chat history will serve as the context for the next round of conversation.
Run ./build/bin/main -h to explore more options!
Deploy and run LLM on LattePanda Sigma (LLaMA, Alpaca, LLaMA2, ChatGLM)
Deploy and run LLM on Raspberry Pi 5 vs Raspberry Pi 4B (LLaMA, LLaMA2, Phi-2, Mixtral-MOE, mamba-gp
Deploy and run LLM on Raspberry Pi 4B (LLaMA, Alpaca, LLaMA2, ChatGLM)





Comments