RAG

As organizations adopt Large Language Models (LLMs) across customer service, marketing, research, and product development, rigorous evaluation of these models becomes a business-critical capability. Poorly evaluated models can lead to misleading outputs, legal liabilities, and damaged user trust. Stakeholders ranging from data scientists and ML engineers to product managers and compliance teams need to understand how to measure LLM performance, reliability, and fitness for production.

This post dives deep into the most widely used LLM evaluation metrics, what they measure, how they work, where their limitations are, and when to use them.

RAG

1. Perplexity

What It Is

Perplexity is a standard metric in language modeling that quantifies how well a language model (LM) predicts a sequence of tokens. In simple terms, it measures how “confused” the model is when generating text: the lower the perplexity, the better the model is at predicting what comes next.

Intuition

If a model assigns high probability to the correct next token, it means the model is confident and not “perplexed”, resulting in low perplexity. Conversely, if the model spreads its probability mass across many wrong options, perplexity will be high.

You can think of it like this:

Formula

\[\text{Perplexity} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(x_i)}\]

where

The base 2 logarithm means perplexity is expressed in terms of bits, as in “how many bits of uncertainty” the model has.

How to Test It

  1. Choose a held-out test set of tokenized text.
  2. Use your trained language model to calculate the probability of each token in the sequence.
  3. Apply the formula above to compute the perplexity score.

Popular libraries like HuggingFace Transformers and OpenLM provide built-in utilities to compute perplexity.

Business Example

Imagine you’re a product manager evaluating which LLM to fine-tune for internal knowledge search. You compute perplexity on your company’s corpus using two models:

Claude’s lower perplexity means it’s better at modeling your internal documents and will likely produce more fluent and relevant completions.

Limitations

When to Use

2. Exact Match (EM)

What It Is

Exact Match (EM) is one of the simplest yet most stringent metrics used in evaluating language model outputs. It checks whether the predicted output matches the reference (ground truth) exactly. If they match perfectly, the score is 1; otherwise, it is 0.

Intuition

Imagine asking a model: “What is the capital of France?” If the model responds with “Paris,” that’s a perfect match. If it says “The capital of France is Paris,” or just “paris” (lowercase), that may still be correct in meaning, but EM will give it a 0 if the formatting isn’t identical.

Thus, EM is ideal when you require precision and can’t tolerate variation in wording or structure.

Formula

\[EM = \frac{\sum_{i=1}^{N} \mathbb{1}[y_i = \hat{y}_i]}{N}\]

where

This formula counts how many predictions are exactly correct, then divides by the total number of predictions.

How to Test It

  1. Prepare a set of ground truth answers for your task.
  2. Generate model predictions for the same inputs.
  3. Apply the EM formula by comparing each prediction to its corresponding ground truth.

You may also want to normalize the text before comparison (e.g., remove punctuation, lowercase, strip whitespace) depending on your use case.

Business Example

📨 Invoice Processing: A company builds an LLM to extract invoice numbers from customer emails. Since invoice numbers must match exactly (e.g., “INV-20394”), EM is used to measure how often the model extracts the exact string correctly.

Limitations

When to Use

3. BLEU / ROUGE / METEOR

What They Are

These are n-gram overlap metrics widely used to evaluate text generation tasks such as machine translation, summarization, and text rewriting.

Intuition

Imagine the model’s output is a guess and the reference is the gold answer.

BLEU Formula

\[BLEU = BP \cdot \exp\left( \sum_{n=1}^N w_n \log p_n \right)\]

where

ROUGE Formula (ROUGE-N)

\[ROUGE\text{-}N = \frac{\sum_{S \in \text{Reference}} \sum_{gram_n \in S} \text{Count}_{match}(gram_n)}{\sum_{S \in \text{Reference}} \sum_{gram_n \in S} \text{Count}(gram_n)}\]

where

METEOR Formula (Simplified)

\[METEOR = F_{mean} \cdot (1 - Penalty)\]

where:

How to Test It

  1. Generate model output for a test set.
  2. Compare each output to one or more reference texts.
  3. Compute overlapping n-grams at different sizes (1-gram to 4-gram).
  4. Use libraries like sacrebleu, nltk.translate, or evaluate (from HuggingFace) to calculate BLEU, ROUGE, and METEOR scores.

Business Example

🛍️ E-commerce Product Descriptions: A retailer uses LLMs to auto-generate product descriptions. To evaluate fluency and informativeness:

Limitations

When to Use

4. BERTScore

What It Is

BERTScore is a metric that evaluates the quality of generated text by measuring semantic similarity between the candidate output and the reference using pre-trained contextual embeddings (usually from BERT or RoBERTa).

Unlike traditional n-gram overlap metrics (like BLEU or ROUGE), BERTScore can detect when the model’s output has the same meaning as the reference even if the words are different.

Intuition

Instead of looking for exact word matches, BERTScore embeds every word in the candidate and reference into a high-dimensional space using a pre-trained language model. It then computes how close these embeddings are, word by word, using cosine similarity. The closer the match, the better the semantic alignment.

Formula

\[\text{BERTScore} = \frac{1}{|\hat{y}|} \sum_{\hat{w} \in \hat{y}} \max_{w \in y} \text{cosine}_{sim}(\text{embed}(\hat{w}), \text{embed}(w))\]

Symbols Explained:

How to Test It

  1. Tokenize and embed both the candidate and reference sentences using a BERT-based model.
  2. Compute cosine similarity between each token in the candidate and every token in the reference.
  3. For each candidate token, select the maximum similarity score.
  4. Average these maximum scores to get the final BERTScore.

You can use the bert-score Python package (https://github.com/Tiiiger/bert_score) for easy implementation.

Business Example

💬 Customer Support QA: A company builds an LLM-based assistant to answer customer questions using internal documents.

These answers are semantically the same, though not word-for-word matches. BERTScore recognizes this alignment, while BLEU or ROUGE may penalize the variation.

Limitations

When to Use

5. Human Judgment

What It Is

Human judgment refers to evaluating language model outputs using human annotators who assess the quality of generated responses along various subjective dimensions. These can include:

This is considered the gold standard for evaluating open-ended generative tasks.

Intuition

Unlike automated metrics that compare outputs numerically, human judgment captures qualitative nuances. This includes subtle errors, logical coherence, and appropriateness that machines might miss. It enables real-world usability evaluation.

How to Test It

There are several common setups for human evaluations:

  1. Likert Scale Rating:

    • Annotators rate outputs on a fixed scale (e.g., 1–5 or 1–7).
    • Example: “Rate this summary for informativeness.”
  2. Pairwise Comparison:

    • Annotators are shown two outputs for the same input and asked: “Which is better?”
    • Often used in A/B testing to compare different models or prompting techniques.
  3. Ranking or Point Allocation:

    • Annotators rank multiple outputs or assign points proportionally.
  4. Task-Specific Criteria:

    • Create rubrics tailored to domain needs (e.g., legal clarity, medical safety).
  5. User Studies:

    • Evaluate with end users in real settings; observe satisfaction or task success.

Business Example

⚖️ Legal Tech Application: A firm is developing an AI tool to generate legal clause summaries. They run a blind study with 5 legal professionals:

Limitations

When to Use

Best Practices

6. LLM-as-a-Judge

What It Is

LLM-as-a-Judge is a scalable, automated method for evaluating language model outputs using another large language model (LLM) to score, rate, or compare candidate outputs. The evaluator LLM is prompted to act as a reviewer or critic, assessing the quality of model responses across various dimensions like correctness, fluency, and helpfulness.

This is particularly useful in fast iteration cycles and large-scale experiments where human evaluation would be too slow or costly.

Intuition

Instead of relying on human annotators, you ask a trusted LLM (e.g., GPT-4, Claude, or Gemini) to act like a judge:

This approach leverages the evaluator model’s understanding of language, reasoning, and task alignment to mimic human judgment.

How to Test It

There are multiple setups depending on your needs:

1. Scoring (Point-wise)

2. Pairwise Comparison (Relative Ranking)

3. Rubric-Based Evaluation

You can use templates or frameworks like OpenAI’s Evals, LMSYS’s Chatbot Arena, or Anthropic’s preference modeling prompts.

Example Prompt Template

You are a helpful and fair assistant. Evaluate the following two responses to the user query. Pick the one that is more relevant, helpful, and correct.

User Query: 
Response A: 
Response B: 

Which response is better? Answer with 'A' or 'B' and explain briefly.

Business Example

💼 Enterprise Chatbot Evaluation: A tech company tests three different LLM providers for their internal support bot. Instead of manually reviewing thousands of outputs:

Limitations

When to Use

Best Practices

7. Span-Level F1

What It Is

Span-Level F1 measures how well a model extracts specific spans of text by combining precision and recall into a single score. It’s commonly used in tasks like Named Entity Recognition (NER), extractive Question Answering (QA), and information extraction.

Rather than checking if the entire sentence is correct, Span-Level F1 focuses on whether the specific parts of interest (spans) are correctly identified.

Intuition

You want your model to extract correct spans (like names, dates, or answer phrases). Precision tells you how many of the spans your model predicted are correct. Recall tells you how many of the correct spans were found. F1 balances the two, high F1 means your model is both accurate and comprehensive.

Formula

\[F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]

where

\[\frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}\] \[\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}\]

F1 ranges from 0 (no correct predictions) to 1 (perfect match).

How to Test It

  1. Annotate your dataset with span-level ground truth (e.g., using BIO tagging or character offsets).
  2. Run your model to extract spans from the input text.
  3. Compare predicted spans to ground truth: Count True Positives, False Positives, False Negatives.
  4. Compute precision, recall, and F1 using the formulas above.

Libraries like seqeval, scikit-learn, and HuggingFace evaluate can automate this.

Business Example

🔐 PII Extraction in Customer Support: A company wants to automatically redact customer PII (like email addresses, phone numbers, and account IDs) from incoming emails.

Limitations

When to Use

8. Faithfulness / Groundedness

What It Is

Faithfulness (also known as groundedness) measures whether a model’s generated output is factually supported by a given context or source material. It’s especially important in Retrieval-Augmented Generation (RAG) systems, where models are expected to generate answers based on retrieved documents.

A response is considered faithful if:

Intuition

Faithfulness goes beyond fluency or relevance. A fluent answer may sound good but still hallucinate facts. Faithfulness ensures that what the model says is verifiably grounded in the source documents.

How to Test It

There is no single formula, but here are common methods:

1. Human Annotation

2. LLM-as-a-Reviewer

3. Binary / Scale-based Evaluation

4. Fact Matching or Evidence Tracing (if structured references exist)

Symbolic Representation (Approximate Heuristic)

While there’s no official formula, a conceptual score could be:

\[\text{Faithfulness Score} = \frac{\text{Number of supported claims}}{\text{Total factual claims in output}}\]

where:

This can be applied in automated or manual workflows.

Business Example

🏛️ Chatbot Compliance in Enterprise IT: A chatbot provides answers based on internal policy PDFs.

This output is unfaithful and could cause a policy violation. Evaluating faithfulness helps ensure compliance and trust.

Limitations

When to Use

Best Practices

9. nDCG / MRR

What They Are

Both nDCG (normalized Discounted Cumulative Gain) and MRR (Mean Reciprocal Rank) are standard metrics for evaluating ranking quality , particularly useful for systems that return ranked lists such as search engines, recommendation systems, and RAG retrievers.

Intuition

Imagine a search engine returns a list of 10 items. Even if all the right answers are there, placing the most relevant ones at the top is crucial. nDCG rewards systems that put the most useful results earlier in the list. MRR is simpler: it just cares about where the first correct answer is.

nDCG Formula

\[nDCG@k = \frac{DCG@k}{IDCG@k} \quad \text{where} \quad DCG@k = \sum_{i=1}^{k} \frac{rel_i}{\log_2(i + 1)}\]

where

nDCG ranges from 0 to 1, with 1 meaning perfect ranking.

MRR Formula

\[MRR = \frac{1}{N} \sum_{i=1}^N \frac{1}{rank_i}\]

where

If the correct result appears at rank 1, the score is 1. If it appears at rank 5, the score is 1/5. The MRR is the average across all queries.

How to Test It

  1. Use a dataset with ground-truth relevance labels for each query.
  2. For each query, have the model return a ranked list of items (documents, passages, FAQs, etc.).
  3. Assign relevance scores to each item in the list.
  4. Compute DCG, IDCG, and nDCG, and/or MRR based on where the first correct answer appears.

You can use libraries like scikit-learn, evaluate, or TREC_eval.

Business Example

🔎 FAQ Retrieval in Customer Support: A business uses an LLM-backed search to return the top 5 most relevant FAQ articles for a customer query.

Limitations

When to Use

10. Hallucination Rate

What It Is

Hallucination rate refers to the percentage of model outputs that contain factually incorrect, fabricated, or unsupported claims, particularly in contexts where the model is expected to generate outputs based on verifiable knowledge (e.g., from retrieved documents or structured databases).

This metric helps assess the factual reliability of generative models.

Intuition

Even if an output sounds fluent or well-structured, it may invent names, dates, citations, or facts that aren’t supported by any source. This is known as a hallucination. Tracking the hallucination rate helps quantify the risk of such errors in real-world deployments.

Approximate Formula

\[\text{Hallucination Rate} = \frac{\text{Number of hallucinated outputs}}{\text{Total number of evaluated outputs}} \times 100\%\]

where

The rate is typically expressed as a percentage.

How to Test It

  1. Manual Annotation (Gold Standard):

    • Human reviewers compare generated output with source/reference documents.
    • Each response is labeled as faithful or hallucinated.
  2. LLM-Based Fact-Checking:

    • Use a second LLM to identify and verify factual claims.
    • Prompt it to mark which claims are unsupported or false.
  3. Entity Matching / Fact Retrieval (Structured Data):

    • Compare outputs against known facts in databases (e.g., Wikidata, product catalogs).
  4. Scoring Granularity:

    • Binary (Yes/No)
    • Fraction of hallucinated sentences/claims per output

Business Example

⚖️ Factual Quality in Legal Summarization: A law firm uses an LLM to summarize contracts. Some generated summaries invent obligations or clauses not found in the original document.

Limitations

When to Use

Best Practices

TL;DR: When to Use Each LLM Evaluation Metric?

Below is a quick-reference table summarizing 10 essential LLM evaluation metrics and the ideal scenarios for their application. Use this table to guide your evaluation strategy across generative, extractive, and retrieval-augmented tasks.

Metric Best When To Use Example Use Case
Perplexity Measuring fluency and model confidence during pretraining or fine-tuning Comparing base model performance on internal corpora
Exact Match (EM) Binary classification/extraction tasks with exact targets Invoice number extraction
BLEU / ROUGE Template-style generation (translation, summarization) Product description generation
METEOR Text generation with flexibility for synonyms and word order Summarization evaluation allowing for lexical variation
BERTScore Semantic equivalence matters more than exact wording Paraphrase detection or answer alignment
Human Judgment High-stakes or subjective evaluation requiring nuanced, contextual understanding Legal summarization, creative content generation
LLM-as-a-Judge Scalable comparisons or preference rankings during iteration A/B testing two model outputs using GPT-4 as an evaluator
Span-Level F1 Extractive tasks requiring structured span annotation Named Entity Recognition, PII redaction
Faithfulness RAG, policy-aligned, or document-grounded generation Enterprise chatbot constrained by internal policy PDFs
Hallucination Rate High-risk environments needing factual guarantees Legal, healthcare, or financial summarization applications

Notes:

Conclusion

LLM evaluation is a multi-dimensional task requiring both quantitative rigor and qualitative insight. No single metric suffices for all cases. The best practice is to combine multiple metrics, automated and human-driven, based on your application’s needs. As LLMs evolve, so must your evaluation strategy. Understand the tradeoffs, invest in tooling, and keep the feedback loop open between engineering, product, and compliance teams.

For further inquiries or collaboration, feel free to contact me at my email.