How a Fortune 500 slashed hallucinations to create 94.7% accurate LLM agents for SQL

Lamini helped one of our Fortune 500 customers reach 94.7% LLM accuracy with their Business Intelligence agent for a text-to-SQL use case. How? Using Lamini Memory Tuning on the open-source Mistral 2 model with the customer’s SQL schema. Read on for a deep dive into the challenge this customer faced, and a walkthrough of how to use Lamini Memory Tuning to achieve unprecedented performance.

The Challenge

Answering natural language queries from analysts with live data is an enticing use case for LLMs: empowering users across the organization to make data-driven decisions more quickly, without being bottlenecked on a data science team (who would rather be doing actual data science instead of making ad hoc dashboards) is clearly valuable. It’s also very difficult to get from an exciting demo to a robust, reliable application your teams can depend on for critical business decisions.

This particular customer reached out to us after multiple internal teams had spent months on the problem — advanced RAG and prompting techniques topped out at 50% accuracy. Accuracy here means their agent successfully:

  1. Turned the initial natural language query into SQL that was both syntactically valid AND semantically correct for their complex data schema,
  2. Retrieved live data using that query, and
  3. Used the data to answer the initial query in natural language.

The problem was in that first step: earlier approaches hallucinated columns that weren’t present in the schema—the query wouldn’t actually work—or the generated queries didn’t generate meaningful results, meaning the generated SQL didn’t actually match the intent of the question to the non-intuitive nuances of the database schema.

Coin flip accuracy clearly wasn’t good enough.

Lamini Memory Tuning's Impact

“Fine-tuning doesn’t have a lot of docs or best practices. It's a lot of trial and error, so it takes weeks to finetune a model. With Lamini, I was shocked — it was 2 hours.” — Fortune 500 engineering leader

The bottom line: Lamini Memory Tuning achieved 94.7% accuracy on this use case. But that’s not all that matters.

  • The customer dev team could iterate independently after just a single week of onboarding to achieve results.
  • Lamini Memory Tuning clocks in under two hours (and at Lamini, we’re working on making it just minutes!)
  • The customer was able to move quickly, because there was no advance data preparation or manual labeling needed —just a SQL schema to get started.
  • Context windows aren’t an issue, because Lamini Memory Tuning can recall billions of tokens of data.
  • The Memory Tuned model is composable with existing RAG and prompt-tuning workflows and libraries.
  • Lamini Memory Tuning creates future-proof models: when the next open-source LLM ships with better general reasoning capabilities (but likely the same memory issues), this customer will be able to automatically upgrade that LLM with the same Memory Tuning infrastructure they’ve already built.

These results are exciting, and it’s all just early days. In this post, you’ll see how you can apply the same techniques to break the LLM accuracy barrier and achieve 95-99% performance for your own LLM code applications.

Lamini Memory Tuning Walkthrough

You’ll walk through how to Memory Tune a Business Intelligence (BI) agent to support text-to-SQL tasks for business analyst teams. The workflow is straightforward:

  1. The user asks a question
  2. A 1st LLM generates SQL from the user query
  3. A SQL engine executes the generated SQL
  4. Finally, a 2nd LLM that generates a response based on the executed SQL output to return to the user

Our Approach

AI is highly iterative, so starting small pays off huge dividends. Before you get started, focus on finding the smallest scope, before expanding it further. In this case, downscoping to what matters meant successfully reaching over 90% accuracy on a single table for this BI agent workflow.

Pipeline with Lamini Memory Tuning

With the initial scope in hand, start by setting up a reproducible, scalable baseline. This involves running the baseline BI agent, curating a small set of user-query-to-SQL examples (~10) that need to work (but don’t—yet!), and then running an LLM evaluation agent over them (e.g. with Lamini Inference). Initial performance will be poor — that’s expected. The goal is to identify where the LLM needs the most improvement.

Next, diagnose which part of the agent is failing: in this case, it looks like it’s the 1st LLM that keeps hallucinating columns and values in the table. Set up a data generation agent (e.g. with Lamini Inference) to tune the 1st LLM with Lamini Memory Tuning on the table schema. Within a day, have the model running at nearly 95% accuracy!

The resulting BI agent will look very similar: the only difference is that LLM 1 is now Memory Tuned and able to generate accurate SQL consistently.

Below is the step-by-step approach.

Step 1: Diagnose hallucinations in the agent workflow (baseline)

Starting pipeline + Diagnosing problems

15 minutes. First, confirm that you have a hallucination problem. In this case, when the user asked “Can you filter cost data for April?”, the column value was written in the wrong format, so the response returned 0 rows from the table. This is particularly insidious because the SQL still successfully executes — it’s syntactically valid, but semantically wrong. What’s worse is that in many enterprises, data formats and content are often confusing or arbitrary. For example, “042024” isn’t too hard to read for April 2024, but what if the fiscal year of that table, which you inherited from an M&A deal, started in a different month? Getting this right can be quite challenging.

If the results all look good, then dial up the difficulty. You could call this an “adversarial playground”: a place to play with your agent or LLM on reasonable requests, making it harder and harder as well as less and less reasonable until it breaks.

Step 2: Curate the easiest examples that still break (evaluation)

1 hour. Record what doesn’t work. The goal is to get to a small set of the easiest examples that still break. This means the task is within reach, but not reached yet, and you can get your LLM to be smarter before reaching for the next level of tasks.

How small is small? Start with as few as 10: your goal is to see results fast, then start the next iteration. This is your evaluation dataset. To create an evaluation agent that can automatically run and score LLM results on your dataset, read more on how we did it and run the code. This makes it easily reproducible across your team and scalable for your next (harder) iteration!

In this case, the customer curated 19 examples in their evaluation dataset, of which the BI agent scored five correctly, for 26% accuracy. You can even start from 0% accuracy. Non-zero starting accuracy allows you to assess if there’s backwards compatibility on baseline correct examples, but it’s not required.

Finally, triage where the hallucinations are coming from within your agent. Which LLM call is failing? This is where running your evaluation agent can be useful, to quantify where things are going awry. In this case, the 1st LLM keeps hallucinating columns and values in the table, so the SQL queries that it generated were either malformed or would fail silently, returning the wrong results to the 2nd LLM. The 2nd LLM was able to return a good response to the user, closely following the output of the executed SQL query as it relates to the user query in the prompt. So, the focus was therefore on improving the 1st LLM to improve the BI agent.

Step 3: Prepare data using LLM data agents on hallucination examples

Data Prep with Data Agents

2 hours. Now that you know to focus on the 1st LLM, look at what the 1st LLM takes in as its prompt and outputs. For Lamini Memory Tuning to work, you need to show the 1st LLM examples of SQL queries covering the breadth of your schema as well as examples of user queries that you expect would be asked.

Sure, in the old days, you could manually collect and label a dataset of these SQL-user-query pairs…but this is the LLM era! LLM data agents can do that for you. One agent inspects the table schema and content to write SQL queries. The other agent hypothesizes analyst questions that could be answered using the data returned from the generated SQL queries.

The trick is to work backwards from what you were generating before with the 1st LLM. So, you’ll want a data agent to first generate SQL queries, then another data agent to take those SQL queries and generate user queries.

Since you want the SQL queries to be grounded in your schema (the whole point!), you can prompt and RAG an LLM data agent to ingest part of your schema to generate SQL queries based on that part. Two key points here:

  • Prompt-tuning: The key is to take in only a part of it at a time to ensure it focuses on that part of the schema and stays accurate in its generation. Giving it too much work can make it hallucinate—and trust us, you don’t want hallucinations all the way down.
    • Schema representation: In the prompt, the schema can be expressed as a string in many ways. Experimenting with this to get the best SQL query outputs makes your data agent more effective.
  • Hard examples: The key is to make sure the SQL queries resemble the type of queries that your LLM was getting wrong before. What’s the point of generating SQL that’s too easy? Thankfully, you made a handy dandy evaluation dataset with SQL queries that were too hard! Simply give the LLM data agent relevant examples of SQL queries from this evaluation dataset.

Not all generated SQL queries will be high-quality. One simple filter is to execute the SQL queries against your SQL engine, only keeping the ones that return a result. If you want more filtering tricks, you can create an LLM data filter agent to do that automatically.

Next, take your generated SQL queries and their results from the SQL engine, and use those in the prompt for the second LLM data agent. Following a similar process, generate user queries that those SQL queries can answer. Be sure to leverage your evaluation dataset again, and give the LLM data agent relevant examples of user queries from it that were too hard.

How much data do you need to generate? A good rule of thumb is about 1,000 data pairs (SQL query and user query pairs in this case).

What’s the upper limit? The same amount of data the model was originally pre-trained on, so (almost) no limit! For reference, Llama 3 was pre-trained on 15 trillion tokens — you can go far beyond the limits of any context window (e.g. Gemini is a mere 1 million tokens). Not only that, the accuracy far outweighs what gets stuffed into a context window, because lower level optimizations to the weights are being made.

Step 4: Iteratively tune the LLM to stop hallucinating (Lamini Memory Tuning)

Pipeline with Lamini Memory Tuning

30-minute iterations over 1 day. Just like you iterate on your prompt for prompt-tuning, Lamini Memory Tuning is iterative too. After an iteration, you run your evaluation agent to understand where things have improved or not. Then, you debug those examples that haven’t improved by diagnosing and revising the data generation process.

Iteration on Lamini Memory Tuning

So first, the easy part. Submit your dataset of questions and supporting SQL queries to your LLM for Lamini Memory Tuning. With Lamini this is only a few lines of Python code or just one REST API call. For these 1,000 data samples and on a 7B parameter LLM (Mistral 2), this Lamini Memory Tuning job only takes 30 minutes for a single iteration.

Now, the diagnosis part. You run your evaluation agent, and it still gets a bunch of examples wrong. Look more closely at the wrong examples: only the easiest examples passed. Now, look at your data agents pipeline: looks like one problem was that only 30% of the generated data—and only the easy ones—were valid SQL. That means you had thrown out 70% of all generated examples, especially those that resembled hard queries most. The solution? Prompt-tune your data agent for generating harder SQL queries better. You prompt-tune and find that if you express the schema differently, e.g. by deleting irrelevant columns, including an example row in the table, etc.

After editing your data agent, and generating more data, you run Lamini Memory Tuning again. Another 30 minutes whizzes by. This is iterative, so can take several more iterations until you’re happy with performance. This varies depending on how good you are at diagnosing your data agents and adapting them to offer better data that covers relevant cases to the evaluation dataset. Good news: practice helps!

After a few iterations, you achieve 94.7% accuracy running on the same evaluation dataset now. That’s a substantive difference from before: this customer had multiple internal teams working on the same project, and the best they had achieved with advanced RAG alone was only 50%.

Continuous improvement: expanded evaluation & guardrails

You’re impressed with results and you want to take things to the next level. Here’s what you should do next to expand the scope and get to 99% accuracy:

Expand the evaluation dataset

Use the same adversarial playground method—this time, with your memory tuned BI agent. Dial up the difficult and find where it still hallucinates. That is, make the exam harder and expand your evaluation dataset with those harder examples you find. This can help you achieve two things:

  1. Create a more representative set that your colleagues and business leaders will be trying on the model, and
  2. make the LLM learn harder tasks.

After this, follow the same steps to iteratively tune with Lamini Memory Tuning and diagnose your data agents. As you can see, this is a highly iterative process. Keep going until you’re happy with the performance. Excitingly, you no longer have an accuracy ceiling and you’re in deeper control of your agents and LLMs’ new capabilities.

Guardrails

Implement inference-time safeguards to prevent generating malicious or unsafe SQL queries. This is possible using Lamini Inference with programmatic outputs, which is similar to function calling, but gives 100% the right output format (it’s deterministic, not probabilistic!).

You should now be equipped with useful tools to start breaking through accuracy ceilings and hallucination barriers in your LLMs and agents. This walkthrough can be easily extended to other code use cases, where hallucinations on variable names and functions are deal breakers, and where you have an interpreter or execution engine to run code through. Lamini Memory Tuning even works for (messy internal) complex codebases!

Start using Lamini Memory Tuning on your own LLMs and agents

Lamini Memory Tuning can unlock previously unattainable accuracy for your enterprise LLM use cases. Contact us at info@lamini.ai to get started.

Stay tuned for additional Lamini Memory Tuning blogs in the future, including case studies across different industries, secure on-premise deployments, and a deep dive of technical details. More use cases, e.g. in high-precision Q&A for earnings calls in financial services, high-precision product SKU recommendations in retail, and high-precision healthcare ICD-11 codes are shared in LLM Photographic Memory Evaluation Suite.