While a variety of technical and programming copilots now exist, they are typically cloud-based with usage or subscription costs and have privacy issues. This project seeks to build a local and customizable LLM agent focused on current AMD technologies (with emphasis on GPUs) and python with related libraries. It will answer technical questions in English and provide python coding suggestions. While the project utilized AMD’s AAC cloud for development, the finetuned model can run with about 16GB of VRAM on a local GPU that supports ROCm and be accessible from a development computer or within a local network. This will keep development and code queries private when for example, the code cannot be open-source for some reason. The main goal is to increase productivity and reduce learning time for python development within on-premise environments that use AMD technologies.
ApproachCurrent cloud-based language models typically have knowledge cutoffs that are months in the past. For example, the knowledge cutoff of Gemini 1.5 Pro is November 2023 and GPT-4o is October 2023. But even then, most training data on a particular subject or technology will be statistically centered earlier than this and query responses may reflect dated knowledge. This is an issue when querying an LLM on subjects that advance more rapidly, such as within AMD technologies. One solution is to finetune an LLM with more current knowledge.
To do this, a knowledge-distillation approach was taken, where a larger, more powerful teacher model is used to train a smaller, less powerful student model on specific knowledge, in this case AMD technologies and python. This involved several steps, including generating related questions, building a corpus of current documentation, answering the questions to build a training set, finetuning the LLM and analyzing responses to questions both with and without finetuning.
MethodsA corpus was constructed with more recent documentation for use in a retrieval-augmented generation (RAG) implementation. Queries can extract relevant information from a vector store (FAISS in this case) to provide additional context for the larger teacher model on questions. The corpus for RAG consisted of a subset of publicly available AMD and related open-source documentation including PDFs, websites, blogs, press releases (for 2023 and 2024) and github repositories. Lists of the sources used can be found in the github corpus directory.
To create a training set for finetuning, questions on related topics and python coding were requested from both Google Gemini and ChatGPT-4o. In all, a total of 1111 technology questions and 547 coding questions were generated. With the use of RAG, these questions were answered via API using Google Gemini 1.5 Pro. The questions and answers saved as CSV files were converted into two Alpaca training sets having instruction-output pairs.
Llama 3_1-8B-Instruct was selected as the base model, due to a relatively small size and a recent release. Torchtune was used for full finetuning, on a single Instinct MI210 GPU. The 1658 questions and answers were presented for 5 epochs, with diminishing losses shown below in Figure 1. GPU VRAM usage peaked at about 52GB.
Figure 1
Additional questions were generated using an OpenAI GPT, which was given the previously generated questions as context and asked to generate related new and unique questions for testing. This set was composed of 150 AMD technology questions and 60 python coding questions. The training questions and the test questions were presented to both the original Llama3_1-8B-Instruct model and the finetuned Llama3_1-8B-Instruct-AMD-python model for comparison.
AnalysisThe original Llama 3.1 model does quite well on responses and python code examples. However, the finetuned model seems to produce more specific, although briefer and more accurate responses. The briefer responses are likely be at least partially due to using RAG with a teacher model, which in the case of Google Gemini, can constrain responses RAG to context, probably to avoid hallucinations. Particularly, the original model may refer to older Instinct GPU hardware such as MI8 or older ROCm releases such as 4.0 in responses on current technologies. The finetuned model will more likely respond with more current versions of hardware and software, such as MI300 series or ROCm 6. Sometimes, the produced python code also seems to be more current, but detailed analysis is required to check for both the accuracy and the successful execution of the code on AMD and other hardware.
Complete side-by-side comparisons in CSV spreadsheets are available on the github repository in the responses directory.
Applications and Future DirectionsFigure 1ies can be accessible via a local browser-based chat interface, local API service or from within PyCharm, for example. A solution can include a RAG implementation with a local repository for adding new documents and code. A well-curated training set can improve performance. Additionally, based on HumanEval or similar, a dedicated benchmark of questions and answers on AMD technologies and python would be useful for testing and identifying possible improvements.
Model availability
The finetuned LLM on Hugging Face: davidsi/Llama3_1-8B-Instruct-AMD-python
Comments