Memory RAG: High accuracy mini-agents with embed-time compute

Sharon Zhou, CEO

Today, we're excited to announce the beta release of Memory RAG, a new approach to Retrieval-Augmented Generation (RAG) that achieves 91-95% accuracy across diverse enterprise use cases, compared to 20-59% for traditional RAG implementations. By leveraging embed-time compute during the embedding generation phase, Memory RAG creates more intelligent, validated data representations that dramatically improve information retrieval and reduce model hallucinations.

Instead of using an LLM to figure everything out at inference, Memory RAG lets you use a smaller model, or a mini-agent, that works with pre-validated, well-structured information. For a deep dive into the technology, read our white paper: Memory RAG: Simple High-Accuracy LLMs using Embed-Time Compute.

The Challenge with Traditional RAG

Large Language Models (LLMs) have become essential for modern enterprise applications, but achieving high accuracy while maintaining simplicity remains a significant challenge. Basic RAG systems face a fundamental limitation: they must balance between comprehensive data coverage and context window constraints. To ensure they catch all relevant information, they often need to include large amounts of content in the context window, which degrades accuracy as the LLM gets overwhelmed with information.

Our experimental results highlight this challenge. In a comprehensive evaluation of factual reasoning on financial documents using GPT-4 with OpenAI's built-in RAG, accuracy reached only 59% despite using state-of-the-art models. Even more striking, in complex database querying scenarios for a Fortune 500 enterprise, traditional RAG approaches on query logs with a prompt describing the schema in detail achieved only 20% accuracy using GPT-4.

Introducing Memory RAG: A New Approach

Memory RAG takes a fundamentally different approach by focusing computational power where it matters most: the embedding generation phase. This shift enables small language models to specialize into mini-agents by achieving high accuracy, as they work with pre-validated, well-structured embeddings, rather than having to process complex raw data.

Instead of just converting text into basic numerical representations, Memory RAG uses embed-time compute to process and enhance the data during embedding creation. This enhanced processing:

Identifies and preserves important relationships between different pieces of information
Validates data accuracy and reliability through proprietary checks, including specialized Memory Tuned models for validation
Creates optimized embeddings that require less context to convey meaning
Enables more precise matching between queries and relevant information

By investing compute during embedding creation rather than retrieval, Memory RAG achieves two crucial benefits:

Higher accuracy through better data representation
Faster inference through smaller, more targeted context windows
The ability to turn small language models (SLMs) into highly effective mini-agents for specific tasks, as they can leverage these optimized representations without needing massive model sizes

Dramatic Improvements in Accuracy

Our experimental results demonstrate significant improvements over traditional RAG implementations:

Financial Document Analysis

In a comprehensive evaluation using Wells Fargo earnings calls (2020-2024):

Memory RAG achieved 91% accuracy with an average response time of 5.76 seconds
Traditional RAG with GPT-4 achieved only 59% accuracy

Database Query Understanding

In tests involving complex database schemas using real production data from a Fortune 500 company:

Memory RAG achieved 95% accuracy with an average response time of 1.13 seconds
Traditional RAG with GPT-4 achieved only 20% accuracy, even with enhanced SQL query prompts

Simple Implementation, Powerful Results

Despite its sophisticated underlying technology, Memory RAG is designed for simplicity. It's exposed through a straightforward API that follows familiar RAG patterns:

memory_rag = MemoryRAG()

memory_rag.memory_index(documents=docs)

response = memory_rag.query("What is the warranty period?")

The system handles all the complexity automatically:

Automated knowledge distillation
Optimized embedding creation with embed-time compute

A Bridge to Memory Tuning

Memory RAG is designed to provide a natural progression path to Memory Tuning, our fine-tuning solution. Organizations can start with Memory RAG for immediate accuracy improvements, then transition to Memory Tuning when ready for even greater control and accuracy, without rebuilding their data pipeline. To learn more about Memory Tuning, read our white paper: Enterprise Guide to Fine-Tuning.

The Future of Specialized Mini-Agents

Looking ahead, Memory RAG opens up exciting possibilities for transforming small language models into highly specialized mini-agents - and deploying them at scale. Because Memory RAG handles the heavy lifting during the embedding phase, these smaller models can be effectively trained to become focused mini-agents that achieve surprisingly high accuracy on specific tasks.

The real power comes from being able to deploy massive numbers of these specialized agents in parallel, each handling different aspects of complex workflows. Imagine dozens of mini-agents working simultaneously on tasks like ticket classification, knowledge base lookup, and response generation - all operating efficiently on the same high-quality embeddings. As applications grow more complex, this ability to deploy many specialized agents in parallel will become increasingly valuable for building sophisticated AI systems.

Get Started Today

We're excited to offer Memory RAG in beta. Try it today with $300 in free credits and experience the next generation of RAG technology.

Resources

Sign up for $300 in credits and try Memory RAG for free
Download whitepaper: "Memory RAG: Simple High-Accuracy LLMs using Embed-Time Compute"
Learn more about Memory RAG

‍