Text-to-SQL: Achieving 95% accuracy

Lamini

What is text-to-SQL

Text-to-SQL is a natural language processing (NLP) task that turns plain text into SQL queries. The goal is to empower non-technical users to access their business data without having to be SQL or database wizards. For example, a business leader may want to know “What was our revenue growth last quarter compared to the previous quarter?” To create an SQL query, the business leader would need to be proficient in SQL and have a deep understanding of the database schema. With text-to-SQL, ad hoc data requests that would’ve taken an analyst hours to respond to, can now be answered in minutes.

Text-to-SQL challenges

Translating natural language text into syntactically and semantically correct SQL queries is challenging for many reasons including the inherent ambiguity of natural language, complex database schemas, and advanced SQL operations. LLMs are getting better at code generation, but accuracy is still an issue because LLMs are prone to hallucinating without adequate domain context.

Text-to-SQL approaches

Prompting and RAG

Prompting combined with Retrieval Augmented Generation (RAG) is a common approach to text-to-SQL because it’s relatively easy, cost-effective, and offer fast feedback loops. While prompting and RAG may be fine for very simple schemas and user questions, it doesn’t work well in more complex schemas and data environments in most real-world applications.

Fine-tuning

General purpose LLMs are pretrained on a vast amount of public data but lack the domain-specific data to handle tasks that require deep knowledge of the domain. Fine-tuning allows you to teach a pretrained model knowledge and behavior on a custom dataset. There are many different methods of fine-tuning.

Instruction fine-tuning is a common method that involves getting pre-trained LLMs to follow instructions as a new behavior (to be “chatty”). It’s how GPT3 became ChatGPT.
Lamini Memory Tuning is a more advanced fine-tuning method that embeds new facts reliably into a pre-trained or instruction fine-tuned LLM (more below). Overall, finetuning requires more expertise, resources, and time than prompting, but leads to higher control and precision.

Lamini Memory Tuning

Lamini Memory Tuning is the most advanced and accurate method of finetuning available today. Lamini’s Memory Tuning is a new way to embed proprietary data into open-source LLMs that improves factual accuracy and reduces hallucinations (95% accuracy for one customer compared to 50% with other approaches; hallucinations reduced from 50% to 5%). Inspired by information retrieval, the Lamini model retrieves only the most relevant experts from an index at inference time, not all the model weights, so latency and cost are dramatically lower.

How Lamini Memory Tuning works

Install Lamini
Create a baseline SQL LLM
Create an Evaluation Dataset with <100 data points
Evaluate the SQL LLM with the Evaluation Dataset
Generate tuning data with Data LLMs
Tune model with Lamini Memory Tuning
Iterate and improve your tuned model!

How a Fortune 500 achieves 94.7% Text-to-SQL accuracy

Fine-tuning doesn’t have a lot of docs or best practices. It's a lot of trial and error, so it takes weeks to finetune a model. With Lamini, I was shocked — it was 2 hours.— Fortune 500 engineering leader

Learn more

FAQ

What do you mean by accuracy?

The best practice is to create a dataset with the expected inputs to the LLM and the correct corresponding outputs — this is what we refer to as the Gold Dataset. For the text-to-SQL use case, accuracy would be measured as the percent of correct SQL queries when run against the Gold Dataset – where correctness is defined as valid SQL that correctly answers the user’s question.

What is a Gold Dataset?

A Gold Dataset is a reliable set of inputs and outputs from the LLM, which is used to benchmark improvements and offer a north star towards which the LLM should improve via tuning.

One misconception is that it needs to cover everything upfront to be useful. The best practice is to start small with around 20 input-output pairs to start improving your model. In early iterations, it is more important to prioritize quality over quantity. After several tuning iterations, you will increase the size and complexity of this dataset. The Gold Dataset should:

Be small enough to quickly validate accuracy.
Include relevant examples that you want the LLM to improve and not regress on.
Include the easiest examples that still fail. For example, you should expect <50% accuracy on the first run and tune it up to your desired accuracy, e.g. 90%, before expanding it to more challenging examples.
Expand after successful iterations to handle more breadth of examples and more difficult examples, which are new easiest examples that still fail in the latest “best” iteration.

How much data do I need to get started?

For the text-to-SQL use case, you can start with your schema and generate user questions (input) and corresponding SQL queries (output) from it — or take advantage of any SQL query logs you might have directly. For the Gold Dataset, you will need about 20-40 input-output pairs to start.

What resources do I need in place to get started?

You will need:

GPU capacity - at least 1 GPU to get started!
A software engineer or data scientist who understands how to build data pipelines and is comfortable working with LLM APIs
A Gold Dataset — and input from a subject matter expert to make sure the SQL queries are correct for that schema on those 20 or so examples

Why can’t we just use RAG or instruction fine-tuning?

RAG and instruction fine-tuning are both acceptable methods for use cases that don’t require a high degree of precision. RAG relies on similarity search across information that doesn’t fit into the prompt and selectively chooses passages to read which makes it unreliable. Instruction fine-tuning can’t ensure that the model's answers are faithful to facts in its training data, but does get the model to be chatty.

Isn’t fine-tuning hard and expensive and needs a ton of labeled data/manual effort?

With LLMs to transform data, finetuning has become much more accessible. For example, with the schema alone, you can reach high accuracy on text-to-SQL. Memory tuning makes it possible for the LLM to operate reliably on custom schemas, by reducing hallucinations and thus time to accuracy and developer effort.
There is some upfront setup required to be successful but that will pay dividends once you start the iteration cycle and tune your model at higher degrees of complexity. We have a lot of code examples and docs to help you get started.

Who has this worked for?

Here’s a case study from a Fortune 500 customer who was able to achieve 94.7% LLM accuracy with their Business Intelligence agent for their text-to-SQL use case.