Regression Testing for LLMs

MLOps Series LLM Evaluation & Safety

Traditional software regression testing relies on deterministic input–output pairs: the same function, given the same arguments, should always produce the same result. Large language models shatter that assumption. Every prompt can yield slightly different wording, structure, or even factual content across runs. Despite this non-determinism, teams shipping LLM-powered features still need confidence that a model update, prompt change, or infrastructure migration does not silently degrade the user experience.

This post lays out a battle-tested framework for LLM regression testing — from golden-set design and semantic comparison to statistical gates that block bad deployments automatically.

Why Regression Testing for LLMs

LLMs are updated frequently — fine-tuning on new data, swapping base models, adjusting system prompts, or migrating to cheaper inference endpoints. Any of these changes can introduce subtle regressions that are invisible to unit tests but painfully obvious to end users. Unlike a compiler bug that crashes immediately, an LLM regression might manifest as slightly worse summaries, hallucinated facts, or broken JSON output that only appears on 5% of requests.

Key insight: LLM regression testing is not about checking for exact string matches. It is about ensuring that the semantic quality, format compliance, and safety properties of outputs remain stable (or improve) across model versions.

Common regression triggers include:

Model swaps — moving from GPT-4 to GPT-4-turbo, or from one fine-tuned checkpoint to another
Prompt engineering changes — even a single word change in a system prompt can flip behaviour
Infrastructure updates — quantisation, new serving framework, different batch sizes
Data drift — user inputs evolving outside the distribution the model was tuned for

Designing Test Suites

A robust LLM regression suite is built from three complementary layers: golden sets for critical paths, edge-case banks for adversarial robustness, and capability-specific probes for measuring targeted skills.

Golden Sets

Golden sets contain curated input–output pairs representing the most important use cases. Each entry includes the prompt, an ideal reference answer, and grading criteria (rubric or metric thresholds). Aim for 200–500 entries that cover the full taxonomy of your product's interactions.

# golden_set.py — Define and manage golden test cases import json from dataclasses import dataclass, field from typing import List, Dict, Optional @dataclass class GoldenCase: case_id: str prompt: str reference_answer: str category: str # e.g. "summarization", "qa", "code_gen" grading: Dict[str, float] # metric_name → threshold tags: List[str] = field(default_factory=list) metadata: Optional[Dict] = None def load_golden_set(path: str) -> List[GoldenCase]: with open(path) as f: raw = json.load(f) return [GoldenCase(**r) for r in raw] def filter_by_category(cases: List[GoldenCase], cat: str): return [c for c in cases if c.category == cat] # Example golden entry (JSON) # { # "case_id": "sum-017", # "prompt": "Summarize the following earnings call transcript...", # "reference_answer": "Revenue grew 12% YoY driven by...", # "category": "summarization", # "grading": {"rouge_l": 0.45, "semantic_sim": 0.82, "factual_precision": 0.90}, # "tags": ["finance", "long-context"] # }

Edge Cases & Adversarial Inputs

Edge-case banks probe failure modes that golden sets are not designed to catch. These include extremely long inputs, multilingual prompts, injection attempts, ambiguous queries, empty inputs, and inputs with unusual Unicode characters. For each edge case, define the acceptable behaviour (e.g., the model should refuse politely, not hallucinate).

Capability-Specific Probes

Rather than a single aggregate score, measure individual capabilities: reasoning, instruction following, format compliance, factuality, and safety. Each probe set isolates one skill so that a regression in summarisation quality does not get masked by improvements in code generation.

Golden Sets

200–500 curated pairs
Cover critical product paths
Human-verified reference answers
Multi-metric grading rubrics

Edge-Case Banks

50–200 adversarial inputs
Boundary conditions & injections
Safety & refusal behaviour
Binary pass/fail grading

Version Comparison Methodology

Comparing two model versions is more nuanced than diffing two log files. LLM outputs are stochastic, so a single run tells you almost nothing. The methodology below handles non-determinism through repeated sampling and semantic similarity.

Handling Non-Deterministic Outputs

For each test case, generate N responses (typically N = 5–10) from both the baseline and candidate models using the same temperature. Aggregate the scores per case before comparing versions. This converts noisy per-sample metrics into stable per-case distributions.

# version_compare.py — A/B comparison between two model versions import numpy as np from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity embedder = SentenceTransformer("all-MiniLM-L6-v2") def semantic_similarity(text_a: str, text_b: str) -> float: """Cosine similarity between sentence embeddings.""" emb = embedder.encode([text_a, text_b]) return float(cosine_similarity([emb[0]], [emb[1]])[0][0]) def run_comparison(golden_cases, model_a, model_b, n_samples=5): """Run both models on every golden case N times, return per-case metrics.""" results = [] for case in golden_cases: scores_a, scores_b = [], [] for _ in range(n_samples): out_a = model_a.generate(case.prompt) out_b = model_b.generate(case.prompt) scores_a.append(semantic_similarity(case.reference_answer, out_a)) scores_b.append(semantic_similarity(case.reference_answer, out_b)) results.append({ "case_id": case.case_id, "category": case.category, "mean_a": np.mean(scores_a), "mean_b": np.mean(scores_b), "std_a": np.std(scores_a), "std_b": np.std(scores_b), "delta": np.mean(scores_b) - np.mean(scores_a), }) return results

Semantic Similarity for Comparison

Exact string matching is useless for free-text outputs. Instead, compare model outputs against references using embedding-based cosine similarity (fast, good for topical alignment), ROUGE / BERTScore (captures token overlap plus semantics), and LLM-as-a-judge (a stronger model grades the candidate's output on a rubric). Layer these metrics: use embedding similarity as a fast filter and LLM-as-a-judge for borderline cases.

Warning: Relying solely on embedding cosine similarity can miss subtle factual errors. Two sentences can be semantically very close (cosine > 0.95) yet contradict each other on a critical detail. Always pair semantic metrics with a factuality check for high-stakes applications.

A/B Testing for Models

Beyond offline test suites, run live A/B tests where a small percentage of production traffic is routed to the candidate model. Capture user-facing metrics — click-through rate, task-completion rate, explicit thumbs-up/down — alongside automated quality scores. The offline suite gates deployment; the A/B test validates real-world impact.

Statistical Significance Testing

Eyeballing mean scores across versions is dangerous. A 0.5% drop might be noise, or it might be a real regression affecting thousands of users. Statistical testing quantifies that uncertainty.

Confidence Intervals & Paired Tests

Because both models are evaluated on the same set of test cases, use a paired test (e.g., paired t-test or Wilcoxon signed-rank) rather than an independent two-sample test. This dramatically increases statistical power by controlling for case-level variance.

# significance.py — Statistical testing for model comparison import numpy as np from scipy import stats def paired_significance_test(scores_a, scores_b, alpha=0.05): """Paired t-test with confidence interval on mean difference.""" diffs = np.array(scores_b) - np.array(scores_a) n = len(diffs) mean_diff = np.mean(diffs) se = np.std(diffs, ddof=1) / np.sqrt(n) # 95% confidence interval t_crit = stats.t.ppf(1 - alpha / 2, df=n - 1) ci_low = mean_diff - t_crit * se ci_high = mean_diff + t_crit * se # Two-sided paired t-test t_stat, p_value = stats.ttest_rel(scores_b, scores_a) return { "mean_diff": mean_diff, "ci_95": (ci_low, ci_high), "t_stat": t_stat, "p_value": p_value, "significant": p_value < alpha, "direction": "improved" if mean_diff > 0 else "regressed", } def bootstrap_confidence_interval(scores_a, scores_b, n_boot=10000, alpha=0.05): """Bootstrap CI for mean score difference — non-parametric alternative.""" diffs = np.array(scores_b) - np.array(scores_a) boot_means = [] for _ in range(n_boot): sample = np.random.choice(diffs, size=len(diffs), replace=True) boot_means.append(np.mean(sample)) lower = np.percentile(boot_means, 100 * alpha / 2) upper = np.percentile(boot_means, 100 * (1 - alpha / 2)) return { "mean_diff": np.mean(diffs), "bootstrap_ci": (float(lower), float(upper)), "regressed": upper < 0, # entire CI below zero → significant regression } # Usage result = paired_significance_test(baseline_scores, candidate_scores) if result["significant"] and result["direction"] == "regressed": print("⚠️ Statistically significant regression detected!") print(f" Mean Δ = {result['mean_diff']:.4f}, p = {result['p_value']:.4f}") print(f" 95% CI: [{result['ci_95'][0]:.4f}, {result['ci_95'][1]:.4f}]")

Rule of thumb: With N = 300 golden cases, a paired t-test can detect a 0.02 point shift in semantic similarity at 80% power (α = 0.05). Fewer cases require larger effect sizes to reach significance — plan your golden set size accordingly.

Parametric Tests

Paired t-test — fast, closed-form
Assumes roughly normal differences
Works well with N > 30 (CLT)
Easy to compute confidence intervals

Non-Parametric Tests

Wilcoxon signed-rank — rank-based
Bootstrap — no distributional assumptions
Better for skewed or bounded metrics
More robust to outliers

Degradation Detection & Alerting

Statistical tests answer "is the difference real?" but production systems also need thresholds and alerts that translate statistical results into actionable deployment decisions.

Multi-Level Threshold Strategy

Define three severity tiers. A hard gate blocks deployment if any critical metric drops below an absolute threshold (e.g., safety refusal accuracy < 98%). A soft gate warns on statistically significant regressions exceeding a relative delta (e.g., > 2% drop in ROUGE-L). An advisory flags any non-significant downward trend for human review.

# degradation.py — Multi-level degradation detection from dataclasses import dataclass from typing import List, Dict from enum import Enum class Severity(Enum): PASS = "pass" ADVISORY = "advisory" SOFT_FAIL = "soft_fail" HARD_FAIL = "hard_fail" @dataclass class MetricThreshold: metric_name: str hard_min: float # absolute floor — hard gate soft_delta_pct: float # max allowed relative drop — soft gate category: str = "global" def evaluate_regression( baseline: Dict[str, float], candidate: Dict[str, float], thresholds: List[MetricThreshold], sig_results: Dict[str, Dict], ) -> List[Dict]: """Evaluate candidate against thresholds and statistical tests.""" verdicts = [] for th in thresholds: m = th.metric_name val_b, val_c = baseline[m], candidate[m] severity = Severity.PASS # Hard gate: absolute floor if val_c < th.hard_min: severity = Severity.HARD_FAIL # Soft gate: relative drop + statistical significance elif val_b > 0: delta_pct = (val_c - val_b) / val_b * 100 sig = sig_results.get(m, {}).get("significant", False) if delta_pct < -th.soft_delta_pct and sig: severity = Severity.SOFT_FAIL elif delta_pct < 0: severity = Severity.ADVISORY verdicts.append({"metric": m, "severity": severity, "baseline": val_b, "candidate": val_c}) return verdicts # Example threshold configuration thresholds = [ MetricThreshold("safety_refusal_acc", hard_min=0.98, soft_delta_pct=1.0), MetricThreshold("semantic_similarity", hard_min=0.75, soft_delta_pct=2.0), MetricThreshold("rouge_l", hard_min=0.30, soft_delta_pct=3.0), MetricThreshold("format_compliance", hard_min=0.95, soft_delta_pct=1.5), ]

Warning: Do not set soft-gate thresholds too tight. LLM scores have inherent variance, and overly aggressive gates cause a flood of false alarms that erode trust in the system. Calibrate thresholds against historical score distributions.

Alerting Pipeline

Wire severity levels into your alerting stack. HARD_FAIL triggers a PagerDuty incident and auto-blocks the deployment pipeline. SOFT_FAIL posts to a Slack channel and opens a review ticket. ADVISORY logs to a dashboard for weekly triage. This layered approach prevents alert fatigue while ensuring critical regressions never reach production.

Automation & CI Integration

The ultimate goal is a fully automated regression gate embedded in your CI/CD pipeline. Every model or prompt change triggers the test suite, runs statistical analysis, and produces a pass/fail verdict — no human in the loop for routine updates, human review only for borderline results.

# ci_regression.py — Automated regression gate for CI/CD import json, sys, os from pathlib import Path def load_baseline_scores(artifact_path: str) -> dict: """Load scores from the last blessed model version.""" with open(artifact_path) as f: return json.load(f) def run_regression_gate(baseline_path: str, candidate_path: str, config_path: str): baseline = load_baseline_scores(baseline_path) candidate = load_baseline_scores(candidate_path) config = json.load(open(config_path)) # Run statistical tests per metric sig_results = {} for metric in config["metrics"]: sig_results[metric] = paired_significance_test( baseline["per_case"][metric], candidate["per_case"][metric], ) # Evaluate thresholds verdicts = evaluate_regression( baseline["aggregated"], candidate["aggregated"], [MetricThreshold(**t) for t in config["thresholds"]], sig_results, ) # Determine overall gate status hard_fails = [v for v in verdicts if v["severity"] == Severity.HARD_FAIL] soft_fails = [v for v in verdicts if v["severity"] == Severity.SOFT_FAIL] # Write results for CI artifact report = { "status": "BLOCKED" if hard_fails else "WARN" if soft_fails else "PASSED", "verdicts": verdicts, "sig_results": sig_results, } Path("regression_report.json").write_text(json.dumps(report, indent=2, default=str)) if hard_fails: print("❌ HARD FAIL — deployment blocked") for v in hard_fails: print(f" {v['metric']}: {v['candidate']:.4f} < {v['baseline']:.4f}") sys.exit(1) elif soft_fails: print("⚠️ SOFT FAIL — review required") sys.exit(0) # CI passes but Slack alert fires else: print("✅ All regression checks passed") sys.exit(0) if __name__ == "__main__": run_regression_gate( baseline_path=os.environ["BASELINE_SCORES"], candidate_path=os.environ["CANDIDATE_SCORES"], config_path="regression_config.json", )

CI tip: Store baseline scores as a versioned artifact (e.g., in S3 or MLflow). Each successful deployment updates the baseline. This way, comparisons are always between the last blessed version and the current candidate — not some arbitrary historical snapshot.

Automated Regression Gates in Practice

Structure your CI job as three stages: Evaluate (run the candidate model on the test suite), Analyse (run statistical tests and threshold checks), and Gate (emit a pass/fail exit code). Parallelise the evaluation stage across test-case categories using CI matrix jobs for faster turnaround. Cache embeddings and reference scores to avoid redundant computation.

Dashboard Design & Reporting

Numbers in a JSON file do not drive organisational decisions — dashboards do. A well-designed regression dashboard provides at-a-glance health of every model version and surfaces trends before they become incidents.

Essential Dashboard Panels

Version timeline — aggregate quality scores plotted over model versions, with confidence bands. Spot slow degradation trends.
Per-category breakdown — heatmap of metric scores across capability categories (summarisation, QA, code, safety). Pinpoint exactly which skill regressed.
Head-to-head diff — for any two versions, display side-by-side outputs on the cases with the largest score deltas. Essential for root-cause analysis.
Alert history — timeline of HARD_FAIL and SOFT_FAIL events with links to the triggering commit and regression report.

Quick Metrics Panel

Overall pass/fail badge per version
Aggregate scores with CI bands
Δ from previous blessed version
Number of hard/soft/advisory flags

Deep-Dive Panel

Per-case score distributions
Worst-regressed examples with outputs
Statistical test details (p-values, CIs)
Links to CI runs and artifacts

Build the dashboard on top of your regression report artifacts using tools like Streamlit, Grafana, or a custom React app. The key design principle: the default view should answer "is the latest version safe to ship?" in under five seconds, with drill-down paths for deeper investigation.

Putting it all together: A mature LLM regression testing pipeline combines golden sets + edge cases (breadth), repeated sampling + semantic metrics (handling non-determinism), paired statistical tests (rigour), multi-level thresholds (actionability), CI automation (speed), and dashboards (visibility). Start with a small golden set and a single metric, then expand iteratively as your team gains confidence in the framework.