Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations

TLDR:

  • Lamini Memory Tuning is a new way to embed facts into LLMs that improves factual accuracy and reduces hallucinations to previously unachievable levels — for one Fortune 500 customer, Lamini Memory Tuning led to 95% accuracy compared to 50% with other approaches. Hallucinations were reduced from 50% to 5%.
  • Lamini Memory Tuning is a research breakthrough that overcomes a seeming paradox in the AI world: achieving precise factual accuracy (i.e. no hallucinations) while upholding the generalization capabilities that make LLMs valuable in the first place.
  • The method entails tuning millions of expert adapters (e.g. LoRAs) with precise facts on top of any open-source LLM, like Llama 3 or Mistral 3. If the goal is to get Roman Empire facts exactly right, Lamini Memory Tuning would create experts on Caesar, aqueducts, legions, and any other facts you provide. Inspired by information retrieval, the model retrieves only the most relevant experts from an index at inference time — not all the model weights — so latency and cost are dramatically lower. High accuracy, high speed, low cost: with Lamini Memory Tuning, you don’t have to choose.
  • Contact us to try Lamini Memory Tuning.

Accuracy matters immensely

Yet, general-purpose LLMs are designed to hallucinate, because they are trained to reduce the average error across the examples they’ve seen. They’re pretty good at everything, but perfect at nothing. They can produce fluent English prose because they’ve seen so much of it across the internet, but specific facts—like a date, a revenue number, or a variable name—get muddled in probabilities. As a result, companies have not been able to count on LLMs for the most critical and most valuable use cases – until now.

Introducing Lamini Memory Tuning
Lamini Memory Tuning is a completely new way to fine-tune any existing LLM by tuning millions of LoRA adapters and selecting across them in a wide Mixture of Experts at inference time.

Instead of optimizing average error on everything, Lamini Memory Tuning optimizes for zero error on the specific facts you tell it to remember, so it recalls those facts nearly perfectly. That’s not special on its own. This approach is particularly groundbreaking because it preserves the LLM’s ability to generalize with average error on everything else, and thus continue to produce fluent prose around those facts. Lamini Memory Tuning is a systematic tool for eliminating hallucinations on the facts you care about.

Fortune 500 customers are already using Lamini Memory Tuning to achieve 95% factual accuracy on critical use cases where previous state-of-the-art approaches peaked at 50%.

THE PROBLEM

Prompting and RAG: necessary but not sufficient

Prompting and Retrieval Augmented Generation (RAG) are important methods for surfacing relevant information to the model, shifting its probabilities to consider similar information. This is an important step to getting the model to condition on the right concepts and information, because the model has been trained on so many tasks. Good prompt-engineering and RAG pipelines are critical to improve the overall accuracy of the model.

At times, this is all you need. But other times, you provide the relevant information and the response is still wrong but so close to right — leading to hallucinations.

Why do hallucinations happen with the right data? In the model’s internal representation, the right answer is likely clustered with similar, but wrong, options. The right context increases the probabilities of the right answer and nearby wrong options. The model doesn’t know that a nearly right answer is still wrong, because general models don’t distinguish between exactly right and nearly right — they never learned to take the loss on those answers to zero. Prompting and RAG don’t change that.

Lamini Memory Tuning addresses this directly, by combining methods from information retrieval and AI to teach the model that getting the answer nearly right is the same as getting it totally wrong.

Instruction fine-tuning: the wrong tool for the job

Many teams turn to instruction fine-tuning when other techniques hit a wall on factual accuracy, but instruction fine-tuning, with or without LoRAs, lead to the same issue that pre-training has: it gets to be pretty good at a more narrow dataset, but still perfect at nothing, while being finicky to work with (losing the ability to perform on some general tasks, if you do it wrong).

As a result, teams struggle with unclear choices, long feedback loops, high compute bills, and ultimately underwhelming performance improvements. While instruction fine-tuning can be really valuable (it’s what turned GPT-3 into ChatGPT), it doesn't make models perfect at the facts that matter. In other words, traditional fine-tuning does not ensure that the model's answers are faithful to facts in its training data.

This is why we developed Lamini Memory Tuning.

OUR INNOVATION

Lamini Memory Tuning: near-perfect fact recall via 1 million-way MoE

Lamini Memory Tuning is a fundamentally different fine-tuning approach that effectively teaches any open-source LLM to be near-perfect on facts, while still maintaining its ability to be pretty good at everything else. When the model is supposed to recall a specific fact, Lamini Memory Tuning shifts the entire probability mass to that particular fact (i.e. specific tokens within a particular context), such as the exact SQL schema for your database. This results in output probabilities that are not just closer to the right result, but exactly there.

To do this, Lamini Memory Tuning tunes a massive mixture of memory experts on any open-source LLM. Each memory expert acts like a LoRA adapter that functionally operates as memory for the model. Together, the memory experts specialize in a million different ways to ensure faithful and factual accuracy to the data that it was tuned on. Inspired by information retrieval, these million memory experts are equivalent to indices from which the model intelligently retrieves and routes. At inference time, the model retrieves the most relevant experts at each layer and merges back into the base model to respond to the user query.

The result is a sparsely activated model, called a Mixture of Memory Experts (MoME), that can scale to an enormous number of parameters at a fixed computational inference cost. This means MoMEs have extremely high capacity for the number of facts that can be learned, bounded only by the total size of the training data set. Llama 3 was trained on 15 trillion tokens. Realistically, you will run out of system memory before you run out of memory capacity in a MoME.

Ultimately, this approach makes what were impossible use cases that critically suffer from hallucinations within reach, and drastically improves LLM time-to-accuracy and thus time-to-market.

Read more details in our research paper.

Results

Lamini Memory Tuning has been a game-changing capability with Lamini’s Fortune 500 clients, who are deploying it for the following use cases:

High precision text-to-SQL
  • Client need: Democratize data access by using LLMs to turn natural language questions into database queries.
  • Challenge: The relevant databases had unique internal names and large, messy schemas.
  • Result: We achieved 95% accuracy with Lamini Memory Tuning after 50% accuracy with RAG.
High precision classification
  • Client need: Save thousands of hours by automatically labeling data accurately.
  • Challenge: We had to adhere to an exact taxonomy of 900 categories.
  • Result: We achieved 100% accuracy across thousands of documents.
High precision recommendations
  • Client need: Increase cart size and revenue with AI-powered product suggestions.
  • Challenge: Applications break when product IDs are hallucinated.
  • Result: We achieved 88% accuracy across a 50,000 product database.

A new frontier

Lamini Memory Tuning changes several of the fundamental dynamics and tradeoffs governing how we work with LLMs. We’re in the early days of this new paradigm, and we’re still learning alongside our customers what’s possible. Summarizing a few areas we’re most excited about:

  • Higher accuracy enables full automation as opposed to copiloting.
  • Lower costs let you take your product from internal demos to a wider production audience.
  • Lower latency enables seamless user experiences.
  • Smaller models mean faster development and improvement cycles.

What could you do with models that ran faster, were more accurate, and cost less to develop and run?

Start using Lamini Memory Tuning

Because Lamini Memory Tuning is a cutting-edge technique that embeds your unique data in a new model architecture, we’re exclusively working with select partners.

Contact us to try Lamini Memory Tuning.

Want to learn more?

  • Read the case study to see how a Fortune 500 company is using Lamini Memory Tuning for a 95% accurate text-to-SQL agent.