Guest Post: How I reached 95.8% accuracy from factual data with Lamini and Llama 3.1
This is a guest blog post by Allan Ray Jasa.
In this post, I share my journey from AI novice to achieving impressive accuracy with a fine-tuned LLM in a short period of time. Using Lamini and Llama 3.1, I explored the process of improving an LLM's performance on a book’s first chapter.
When Meta released Llama 3.1, I got so excited. Finally, I thought, there’s an open source LLM that matches the capabilities of GPT-4. This was an idea that seemed more and more unachievable when the months turned to more than a year after GPT-4’s initial release. Llama 3.1 is so impressive that even the versions with fewer parameters, such as the Llama-3.1-8B, appear to perform on par with GPT-4o in terms of adherence to tool/function calling.
After finishing the DeepLearning.AI course called Improving Accuracy of LLM Applications, where Sharon Zhou of Lamini teaches how to fine-tune a Llama 3.1 model on the Lamini platform, I thought of getting my hands in the clay with this LLM fine-tuning business.
To be completely honest, I had tried fine-tuning an LLM before and thought it was not for me. I had taken an online course on fine-tuning a FLAN-T5 model and the Jupyter notebook looked like esoteric incantations to a nameless, faceless LLM deity: what do you mean I need to set the target modules as “q” and “v” in the Lora config? Why should I set auto_find_batch_size to true in the training arguments? With so much code and so many parameters that I didn’t know what for, I felt intimidated.
I thought fine-tuning an LLM was a task that was more suited for data scientists with PhDs, or research engineers at OpenAI or Anthropic. As an iOS developer trying to pivot to an AI career, I had resigned myself to the idea that all I could do was funnel data to an LLM, such as a RAG application, and then present the result of inference back to the user. So imagine my surprise when, with the DeepLearning course I mentioned above, Sharon demonstrated that with Lamini, you can fine-tune a state-of-the-art LLM such as Llama 3.1 with just two to three lines of code.
We will get to the details as we go along, but just to show proof, this is the final code that gave me a fine-tuned Llama 3.1-8B-Instruct that achieved 95.8% accuracy. Three lines:
So what did I do? I chose a different problem to work on. The DeepLearning.AI course focused on fine-tuning an LLM to give accurate SQL statements from natural language. Since the documentation mentioned that Lamini can handle facts from texts, I thought of giving that a try.
I’ve always admired this particular writer and I have a copy of his biography on my shelf. So I thought, why not train the model on the first chapter of this book and see if it can answer my questions? I extracted the text from the first chapter and fed it directly to the model.
Of course, since I am quite new to this, I initially didn't realize that fine-tuning an LLM using Lamini's technology requires question-and-answer pairs, even when the answers are facts rather than SQL statements. This requirement is common in many current LLM fine-tuning approaches. I appreciated Lamini’s descriptive error message so that I could arrive at that conclusion.
Taking advantage of Anthropic Claude’s very wide context window, I asked it to generate about 100 questions from the book’s first chapter, and then, using the sample guide from the documentation, I trained the model, which was Llama 3.1-8B-Instruct.
I thought the fine-tuned model was giving good, although brief, answers until I built a RAG application using the same text from the book, powered by Llama 3.1 (running as Ollama in a MacBook Air):
I compared the answers and saw what was missing: the RAG application was able to give more details to the answer than the fine-tuned LLM. And it’s not like these details weren’t available in the fine-tuning data. They were all there. It’s just that the model wasn’t able to connect the dots, or generalize and integrate the data it was trained with.
Using Claude again, I asked it to generate about 49 new Q&A pairs that generalize information from those previously generated, 100 Q&A pairs. I combined them both, and as instructed by Lamini’s documentation, multiplied it by ten before using it as training data.
With a bit of prompt engineering, I was able to achieve something similar to the RAG application, or at least I thought I did just by asking the same question earlier, and comparing the answer to the RAG application’s answer.
But of course I needed some quantitative measurement. I needed to know overall how this fine-tuned model performs in comparison to the RAG application. So I asked Claude again to generate 25 questions from the text, but I added that the answers to the questions should not be reducible to a name, place, date, or any simple fact. For example, this pair is not valid:
Q: What is the author’s brother’s name? A: The author’s brother’s name is Allan.
Because it can be reduced to just Allan. This is a good one:
Q. Describe the historical context surrounding the author’s birth. A: [a sentence or two]
Because the model will look at the author’s birth date and think what other historical events occurred that time, which was also included in the fine-tuning data.
This set of questions will serve as a gold standard for evaluation, questions that the model hasn’t seen before. And in order to evaluate, I created a sort of scorer, powered by OpenAI’s GPT-4o, that has this prompt:
And then I evaluated by going through all the gold standard questions, asking both the fine-tuned model and the RAG application the same question, having the scorer evaluate each of the answers, and then collecting the results:
For the second iteration, I got this result:
My initial thought was, why bother fine-tuning an LLM when it achieves the same accuracy as a RAG application? But then I inspected the evaluation and found out that on only one question did both the fine-tuned model and RAG application fail. In others, it’s either the fine-tuned was correct, or the RAG was correct, which led me to believe that the generalizations in the fine-tuning data were probably not enough because it wasn’t able to infer in those areas.
In my third iteration, I thought of adding an equal amount of generalization Q&A pairs to the 100 base Q&A pairs. I had assumed Claude had generated a hundred pairs because I kept asking it to continue when it got cut off due to the response limit, but it turned out it only generated 84 new Q&A pairs. Maybe that was the extent to which 100 facts could be generalized.
I fed that again to the model, and this was my final result:
I was so surprised that I was able to increase its performance. Bear in mind that as of this writing, Lamini doesn’t support retraining fine-tuned models. So all I adjusted was the training data, but the starting base model remained the same in all of my iterations.
This journey, from knowing nothing about fine-tuning to achieving 95.8% accuracy in just two days, has been both exciting and enlightening. I am so impressed with what I have accomplished with Lamini. Reflecting on this experience, I have come to appreciate several valuable lessons:
- Accessibility of AI technology: Fine-tuning state-of-the-art language models is no longer limited to PhD-level data scientists. With the right tools, even those new to the field like me can achieve impressive results.
- Iterative improvement: Investigating what to adjust with each round of fine-tuning, and studying the corresponding evaluation, provided insights that guided the next steps.
- Quality of training data: The importance of high-quality, diverse training data cannot be overstated. Generalizing the information and creating thoughtful question-answer pairs significantly boosted the model's performance.
- Simplified fine-tuning process: Thanks to Lamini's API simplicity, the complexity of parameter adjustment is greatly reduced. With only 2-3 key parameters to consider (in my case), users can focus primarily on studying evaluation results and creating quality data, making the fine-tuning process more approachable and efficient.
- Comparison with RAG: While RAG systems are powerful, fine-tuned models can achieve comparable or even superior results in specific domains. The choice between these approaches depends on the use case and available resources.
- Potential for specialized models: This experiment demonstrates the potential for creating highly accurate, domain-specific models that can outperform general-purpose LLMs in niche areas.
When I look ahead, I see several exciting avenues to explore:
- Applying this fine-tuning approach to other domains, models, or larger datasets
- Investigating ways to combine fine-tuned models with RAG systems for optimal performance
As AI technology continues to evolve, the ability to create custom, highly accurate models will likely become an increasingly valuable skill. This project has shown me that with perseverance, creativity, and the right tools, impressive results are within reach for AI enthusiasts at all levels.