Lamini LLM Photographic Memory Evaluation Suite

Lamini

TL;DR

  • A tiny error from LLMs in critical domains like healthcare or finance could have catastrophic consequences.
  • Even after rigorous prompt-tuning, RAG, and fine-tuning, LLMs may still make critical precision errors, such as getting a single letter of a medical code wrong or miscalculating an investment value.
  • Lamini introduces the LLM Photographic Memory Evaluation Suite, which quantifies LLM performance on tasks requiring photographic memory for dependable and precise model evaluation.
  • Join us to build truly Photographic Memory for LLMs! https://jobs.lever.co/laminiai.

You're building LLMs for critical domains like healthcare or finance. In healthcare, getting a single letter of a medical code wrong could lead to disastrous results. Similarly, misunderstanding a key financial term or miscalculating an investment value could cause chaos.

Even after rigorous prompt-tuning, retrieval-augmented generation (RAG), and fine-tuning, LLMs commonly struggle to accurately remember and reproduce numbers, figures, and precise calculations, which is worse for data-intensive use cases. Extremely high precision is undoubtedly important.

Lamini Evaluation Suite: Path to Photographic Memory

Today, Lamini is introducing a new evaluation benchmark suite that quantifies LLM performance on tasks requiring photographic memory for dependable and precise model evaluation.

The suite includes benchmarks that test model precision and recall on specific domain data, such as finance, e-commerce, medicine, etc. We call this “Photographic memory” because the tasks require an exact match. These are usually the kinds of tasks that enterprises work on. The benchmarks can easily be adapted to a specific enterprise use case for private data.

The suite also incorporates well-known open-source benchmarks such as MMLU, TruthfulQA, and others to compare the model's performance against the base model. This helps assess whether the knowledge acquired during pre-training is retained after fine-tuning.

Before we get into the details of the Lamini evaluation suite, let’s look at what evaluation for an LLM looks like.

What are some metrics for LLM evaluation?

There are many standard benchmarks for evaluating LLM outputs. Each serves a different purpose and targets different abilities of LLMs for evaluation.

MMLU (Massive Multitask Language Understanding)
MMLU benchmark for knowledge-intensive question answering measures a model's multitask accuracy across 57 domains.

TruthfulQA
A benchmark to measure whether a language model is truthful in generating answers to questions, and comprises of 817 questions across 38 categories, such as health, law, finance, and politics.

WinoGrande
A benchmark for commonsense reasoning, that includes 273 expert-crafted pronoun resolution problems.

HellaSwag
A benchmark for commonsense natural language inference.

And many more! Lamini's Evaluation Suite uses many of the above metrics for a holistic evaluation.

A Deep Dive into the Lamini Evaluation Suite

Lamini boasts many enterprise deployments spanning diverse domains such as finance, law, healthcare, engineering, and more. The tasks curated for the evaluation suite were carefully selected to mirror genuine enterprise use cases. This selection reflects our experienced understanding of what enterprises truly prioritize, and we have developed a process tailored to meet those needs.

Standard Benchmarks

MMLU (Global facts): We use a sample domain (global facts) sourced from the well-known MMLU benchmark. This assesses the precision of predicted outputs for queries related to global facts. The objective is to evaluate whether the model retains the information learned during pre-training and that the performance of the fine-tuned model does not regress compared to the base model. These are usually multiple-choice questions about well-known facts, and the model should generate an output option that exactly matches the answer.

TruthfulQA: This is another popular benchmark for measuring whether a language model is truthful in generating answers to questions. The benchmark tests whether the model generates false answers that it may have learned from imitating human texts. It uses BLEU and ROUGE scores to rate a model output.

Domain-specific Benchmarks

E-commerce domain:

Use case: Given a catalog of products and product information, how well a model can answer questions about the product, and how accurately it can remember all its details.

Product ID precision score: This benchmark task compares the generated product ID with the exact product ID from the labeled evaluation dataset.
Product Response Subjective score: This evaluates the overall quality of the answer. It compares the generated product information with the overall product information in the dataset.

Medical domain:

Use case: answer questions about the ICD-11 standard. ICD-11 (International Classification of Diseases 11th Revision) is a globally used standard for classifying diseases, health conditions, and related phenomena. It is maintained by the World Health Organization (WHO) and serves as a common language for health information systems, epidemiology, health statistics, clinical care, and research.

ICD Code Precision Score: When a model answers a question about an ICD-11 standard, this metric evaluates the accuracy of the ICD-11 code generated. The standard is pretty stringent, so getting the right code is important.
ICD Code Subjective Score: This metric evaluates the overall quality of the answers regarding correctness and completeness and assigns a score.

Finance domain:

Use case: answer questions about companies' financial performance based on the transcripts of their earnings calls.

Earnings Value Precision Score: When answering the question about a company's financials, the model might output values like 800M $. This score evaluates the accuracy of the value and the units generated.
Earnings Value Subjective Score:  This metric assigns a score to the overall quality of the answers in terms of correctness and coherence.

Dataset Preparation

For standard benchmarks like MMLU, we evaluate them on open-source datasets. We use Eleuther AI’s LM Evaluation Harness to download the datasets and run the evaluations for standard benchmarks. All the evaluation requests run on Lamini servers and produce the desired JSON outputs, which are compared against the golden answers provided.

For custom benchmarks, we prepared Q/A pairs to do instruction finetuning and evaluate the results.

Product dataset: We built a Q/A dataset over an open-source Instacart product catalog. The dataset is available to use at Huggingface. The code for creating the Q/A pairs is open-source https://github.com/lamini-ai/instacart-greg.
ICD-11 dataset: We obtained the ICD-11 standard by scraping the ICD-11 database. The dataset is available to use at Huggingface. The code for creating the Q/A pairs is open-source https://github.com/lamini-ai/lamini-sdk.
Earnings-call dataset:
We obtained the earnings call transcripts by scraping the web. The dataset is available to use at Huggingface. The code for creating the Q/A pairs is open-source https://github.com/lamini-ai/lamini-sdk.

Reproducibility

You can clone the sample repo by following the steps provided here. You’d have to change the environment variables in src/env to add your API keys.To run only the evaluation suite locally, run the following command:

./run-adhoc.sh LOCAL_MODEL_NAME="<any huggingface/Lamini finetuned model/openai model you want to try>"

We're hiring!

Join us to invent and build the world’s largest LLM training system!

Full-Stack Software Engineer:
https://jobs.lever.co/laminiai/4c6a40a5-9688-4f6d-bd65-692d139e5a5a

High Performance Computing (Triton + MPI) Engineer:
https://jobs.lever.co/laminiai/af688bf8-6c6e-42b5-87aa-0ee9afccdced

More roles:
https://jobs.lever.co/laminiai


Published on March 27, 2024