Prompt Testing: Automated Evaluation
Prompts are code — brittle, version-sensitive, and user-facing. A single word change can flip a model from producing reliable JSON to hallucinating wildly. Yet most teams still evaluate prompts by "vibes": paste a few examples into a playground, eyeball the outputs, and ship. This works until it doesn't — a model update silently degrades your summarizer, a new edge case triggers toxic output, or a refactored system prompt breaks a downstream parser.
This post introduces a disciplined, automated approach to prompt testing: building deterministic test suites, curating golden datasets, detecting regressions across model versions, using LLMs to judge other LLMs, and wiring everything into CI/CD so every prompt change is gated by evidence.
Why Test Prompts
Traditional software testing asserts deterministic behavior: function f(x) always returns y. LLM outputs are stochastic, context-dependent, and sensitive to seemingly trivial changes. This makes testing harder — but also more important. Three forces make prompt testing non-negotiable in production:
Prompt Fragility
Research from Microsoft and others shows that reordering few-shot examples, changing a single delimiter, or swapping "Answer:" for "Response:" can shift accuracy by 5–20 percentage points on benchmarks. Prompts are the highest-leverage surface in an LLM application, and they are also the most fragile.
Model Update Risk
When a provider pushes a new model snapshot (e.g., gpt-4-0613 → gpt-4-1106-preview), your prompts run against a different function. Behavioral changes are undocumented, often subtle, and may not surface for weeks. Without automated tests, you are flying blind through every model update.
Regression at Scale
A team with 50+ prompts across agents, RAG pipelines, and classifiers cannot manually QA every change. Regression testing must be automated, fast, and integrated into the development workflow — just like unit tests for application code.
Building Prompt Test Suites
A prompt test suite is a structured collection of input → expected behavior pairs, with assertions that verify the model output meets criteria. Unlike unit tests, prompt tests rarely assert exact string equality. Instead, they check structural, semantic, and constraint-based properties.
Test Case Anatomy
Every test case should capture four elements:
- Input — The user message, context documents, or variables injected into the prompt template.
- Expected behavior — What the output must satisfy (format, content, constraints).
- Assertion type — How to evaluate (exact match, contains, regex, semantic similarity, LLM-judge).
- Metadata — Tags for category, priority, edge-case type, creation date.
# prompt_test_suite.py — Structured test case definition from dataclasses import dataclass, field from typing import List, Optional, Callable import re, json @dataclass class PromptTestCase: name: str prompt_template: str variables: dict assertions: List[Assertion] tags: List[str] = field(default_factory=list) priority: str = "medium" # low | medium | high | critical @dataclass class Assertion: type: str # "contains" | "regex" | "json_schema" | "semantic" | "llm_judge" expected: str threshold: float = 0.85 # for similarity-based assertions # Example test case for a summarization prompt test_summarize = PromptTestCase( name="summarize_earnings_call", prompt_template="Summarize the following transcript in exactly 3 bullet points:\n\n{transcript}", variables={"transcript": EARNINGS_CALL_TEXT}, assertions=[ Assertion(type="regex", expected=r"^[-•]\s.+(\n[-•]\s.+){2}$"), # exactly 3 bullets Assertion(type="contains", expected="revenue"), # must mention revenue Assertion(type="semantic", expected=GOLDEN_SUMMARY, threshold=0.80), # cosine sim ], tags=["summarization", "finance"], priority="critical" )
Assertion Types
Deterministic Assertions
- exact_match — Output equals expected string
- contains — Output includes a substring
- not_contains — Output excludes banned phrases
- regex — Output matches a pattern
- json_schema — Output parses and validates against schema
- length — Token or character count within bounds
Probabilistic Assertions
- semantic_similarity — Cosine similarity ≥ threshold
- llm_judge — A judge model scores on rubric
- classification — Output classified into expected label
- entailment — NLI model confirms output entails reference
- toxicity — Safety classifier score below threshold
- coherence — Perplexity or fluency score above threshold
# Test runner — executing and evaluating a suite class PromptTestRunner: def __init__(self, llm_client, embedding_model, judge_model=None): self.llm = llm_client self.embedder = embedding_model self.judge = judge_model def run_suite(self, suite: List[PromptTestCase]) -> TestReport: results = [] for tc in suite: prompt = tc.prompt_template.format(**tc.variables) output = self.llm.generate(prompt, temperature=0) passed = True details = [] for a in tc.assertions: result = self._evaluate(output, a) details.append(result) if not result.passed: passed = False results.append(TestResult( name=tc.name, passed=passed, output=output, assertion_details=details )) return TestReport(results=results) def _evaluate(self, output: str, assertion: Assertion) -> AssertionResult: if assertion.type == "contains": return AssertionResult(passed=assertion.expected in output) elif assertion.type == "regex": return AssertionResult(passed=bool(re.search(assertion.expected, output))) elif assertion.type == "json_schema": return self._validate_json(output, assertion.expected) elif assertion.type == "semantic": score = self._cosine_sim(output, assertion.expected) return AssertionResult(passed=score >= assertion.threshold, score=score) elif assertion.type == "llm_judge": return self._judge_eval(output, assertion) raise ValueError(f"Unknown assertion type: {assertion.type}")
temperature=0 (or a fixed seed when available) to maximize reproducibility. Save the raw output alongside pass/fail for debugging.
Golden Datasets
A golden dataset is a curated, versioned collection of input-output pairs that represent the expected behavior of your prompt across the full range of production scenarios. It serves as the ground truth for all automated evaluations.
Curation Principles
- Representative coverage — Include common cases, edge cases, adversarial inputs, and multilingual examples proportional to real traffic.
- Human-verified — Every golden output should be reviewed and approved by a domain expert, not just copied from model output.
- Versioned alongside prompts — Golden datasets live in version control next to the prompt templates they evaluate.
- Living documents — Update the golden set when you discover new failure modes in production.
# golden_dataset.py — Schema and loading import json from pathlib import Path from dataclasses import dataclass from typing import List, Dict @dataclass class GoldenExample: id: str input_text: str expected_output: str tags: List[str] category: str # "common" | "edge_case" | "adversarial" verified_by: str # human reviewer name created_at: str class GoldenDataset: def __init__(self, path: str): self.path = Path(path) self.examples = self._load() self.version = self._compute_hash() def _load(self) -> List[GoldenExample]: data = json.loads(self.path.read_text()) return [GoldenExample(**ex) for ex in data["examples"]] def coverage_report(self) -> Dict[str, int]: """Count examples per category for coverage auditing.""" counts = {} for ex in self.examples: counts[ex.category] = counts.get(ex.category, 0) + 1 return counts def filter_by_tag(self, tag: str) -> List[GoldenExample]: return [ex for ex in self.examples if tag in ex.tags] def _compute_hash(self) -> str: import hashlib content = self.path.read_bytes() return hashlib.sha256(content).hexdigest()[:12]
Coverage Metrics
Regression Testing
Regression testing detects when prompt changes or model updates degrade performance relative to a known baseline. The key challenge is that LLM outputs are stochastic — you cannot simply diff strings. Instead, you must compare aggregate metrics and establish statistical significance.
Baseline → Candidate Comparison
The regression testing workflow compares a baseline (last known-good prompt + model) against a candidate (the proposed change). For each golden example, run both and compare:
# regression_tester.py — Compare baseline vs candidate import numpy as np from scipy import stats class RegressionTester: def __init__(self, runner: PromptTestRunner, golden: GoldenDataset): self.runner = runner self.golden = golden def compare(self, baseline_prompt: str, candidate_prompt: str, model: str, threshold: float = 0.02) -> RegressionResult: """Run both prompts on golden data and compare metrics.""" baseline_scores = [] candidate_scores = [] for ex in self.golden.examples: b_out = self.runner.llm.generate( baseline_prompt.format(input=ex.input_text), temperature=0 ) c_out = self.runner.llm.generate( candidate_prompt.format(input=ex.input_text), temperature=0 ) b_score = self._score(b_out, ex.expected_output) c_score = self._score(c_out, ex.expected_output) baseline_scores.append(b_score) candidate_scores.append(c_score) # Paired t-test for statistical significance t_stat, p_value = stats.ttest_rel(candidate_scores, baseline_scores) mean_diff = np.mean(candidate_scores) - np.mean(baseline_scores) passed = mean_diff >= -threshold or p_value > 0.05 return RegressionResult( passed=passed, baseline_mean=np.mean(baseline_scores), candidate_mean=np.mean(candidate_scores), p_value=p_value, degraded_examples=self._find_degraded(baseline_scores, candidate_scores) ) def _find_degraded(self, b_scores, c_scores, drop_threshold=0.15): """Flag individual examples where candidate is significantly worse.""" degraded = [] for i, (b, c) in enumerate(zip(b_scores, c_scores)): if b - c > drop_threshold: degraded.append({ "index": i, "baseline_score": b, "candidate_score": c, "drop": round(b - c, 4) }) return degraded
LLM-as-Judge Evaluation
Many prompt outputs — creative text, explanations, conversational responses — cannot be evaluated with deterministic assertions or embedding similarity alone. LLM-as-judge uses a separate (often stronger) model to score candidate outputs against a rubric, mimicking human evaluation at scale.
Single-Point vs. Pairwise Evaluation
Single-Point Grading
The judge scores one output on an absolute rubric (e.g., 1–5 for helpfulness). Simple to implement but sensitive to position bias and score calibration.
Use when: You have one candidate output per input and need an absolute quality metric.
Pairwise Comparison
The judge compares two outputs (A vs. B) and picks a winner. More robust to calibration drift but requires 2× inference cost.
Use when: Comparing a baseline prompt against a candidate, or ranking multiple prompt variants.
# llm_judge.py — LLM-as-Judge evaluator JUDGE_RUBRIC_TEMPLATE = """You are an expert evaluator. Score the following output on a scale of 1-5 for each criterion. [Input]: {input_text} [Output]: {model_output} [Reference]: {reference_output} Criteria: 1. **Accuracy** — Are all facts correct and consistent with the reference? 2. **Completeness** — Does the output cover all key points? 3. **Conciseness** — Is the output free of unnecessary repetition? 4. **Format compliance** — Does the output follow the requested format? Respond in JSON: {{"accuracy": N, "completeness": N, "conciseness": N, "format": N}} """ PAIRWISE_TEMPLATE = """Compare these two outputs for the given input. [Input]: {input_text} [Output A]: {output_a} [Output B]: {output_b} Which output is better overall? Respond with exactly "A" or "B" and a one-sentence reason. """ class LLMJudge: def __init__(self, judge_client, model="gpt-4o"): self.client = judge_client self.model = model def grade_single(self, input_text, output, reference) -> dict: prompt = JUDGE_RUBRIC_TEMPLATE.format( input_text=input_text, model_output=output, reference_output=reference ) response = self.client.generate(prompt, model=self.model, temperature=0) return json.loads(response) def pairwise_compare(self, input_text, output_a, output_b) -> str: # Randomize order to mitigate position bias import random if random.random() > 0.5: output_a, output_b = output_b, output_a swapped = True else: swapped = False prompt = PAIRWISE_TEMPLATE.format( input_text=input_text, output_a=output_a, output_b=output_b ) verdict = self.client.generate(prompt, model=self.model, temperature=0) winner = "A" if "A" in verdict[:5] else "B" # Correct for swap if swapped: winner = "B" if winner == "A" else "A" return winner def evaluate_suite(self, golden: GoldenDataset, outputs: List[str]) -> dict: scores = {"accuracy": [], "completeness": [], "conciseness": [], "format": []} for ex, out in zip(golden.examples, outputs): grade = self.grade_single(ex.input_text, out, ex.expected_output) for k in scores: scores[k].append(grade[k]) return {k: round(np.mean(v), 2) for k, v in scores.items()}
CI/CD Integration
The final step is wiring prompt tests into your CI/CD pipeline so every prompt change — whether a template edit, model version bump, or golden dataset update — is automatically evaluated before reaching production. This turns prompt engineering from an ad-hoc craft into a software engineering discipline.
GitHub Actions Workflow
# .github/workflows/prompt-tests.yml name: Prompt Regression Tests on: pull_request: paths: - "prompts/**" - "golden_datasets/**" - "prompt_tests/**" schedule: - cron: "0 6 * * 1" # Weekly model drift check jobs: prompt-eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: "3.11" - name: Install dependencies run: pip install -r requirements-prompt-tests.txt - name: Run prompt test suite env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: | python -m prompt_tests.runner \ --suite prompt_tests/suites/ \ --golden golden_datasets/ \ --output results/report.json \ --fail-on-regression - name: Upload test report uses: actions/upload-artifact@v4 with: name: prompt-test-report path: results/report.json - name: Post summary to PR if: github.event_name == 'pull_request' uses: actions/github-script@v7 with: script: | const fs = require('fs'); const report = JSON.parse(fs.readFileSync('results/report.json')); const summary = [ `## 🧪 Prompt Test Results`, `| Metric | Baseline | Candidate | Δ |`, `|--------|----------|-----------|---|`, ...report.metrics.map(m => `| ${m.name} | ${m.baseline} | ${m.candidate} | ${m.delta} |` ), `\n**Verdict:** ${report.passed ? '✅ Pass' : '❌ Fail'}` ].join('\n'); github.rest.issues.createComment({ owner: context.repo.owner, repo: context.repo.repo, issue_number: context.issue.number, body: summary });
Quality Gates
Define automated gates that block merges when prompts regress:
Hard Gates (Block Merge)
- Any critical test case fails
- Aggregate accuracy drops > 2%
- JSON schema validation failures > 0
- Toxicity score exceeds threshold
Soft Gates (Require Review)
- Non-critical test failures > 5%
- LLM judge scores decline on any axis
- New untested prompt templates detected
- Golden dataset coverage below target
# gate_evaluator.py — Automated quality gates from dataclasses import dataclass from typing import List @dataclass class GateResult: name: str gate_type: str # "hard" | "soft" passed: bool message: str class QualityGateEvaluator: def __init__(self, config: dict): self.config = config def evaluate(self, report: TestReport, regression: RegressionResult) -> List[GateResult]: gates = [] # Hard gate: critical test failures critical_failures = [r for r in report.results if r.priority == "critical" and not r.passed] gates.append(GateResult( name="critical_tests", gate_type="hard", passed=len(critical_failures) == 0, message=f"{len(critical_failures)} critical test(s) failed" )) # Hard gate: accuracy regression max_drop = self.config.get("max_accuracy_drop", 0.02) accuracy_drop = regression.baseline_mean - regression.candidate_mean gates.append(GateResult( name="accuracy_regression", gate_type="hard", passed=accuracy_drop <= max_drop, message=f"Accuracy drop: {accuracy_drop:.3f} (max: {max_drop})" )) # Soft gate: non-critical failure rate total = len(report.results) failures = len([r for r in report.results if not r.passed]) fail_rate = failures / total if total > 0 else 0 gates.append(GateResult( name="overall_fail_rate", gate_type="soft", passed=fail_rate <= 0.05, message=f"Fail rate: {fail_rate:.1%} ({failures}/{total})" )) return gates def should_block_merge(self, gates: List[GateResult]) -> bool: return any(g.gate_type == "hard" and not g.passed for g in gates) def needs_review(self, gates: List[GateResult]) -> bool: return any(g.gate_type == "soft" and not g.passed for g in gates)
Prompt testing transforms LLM applications from "works on my machine" experiments into production-grade systems with measurable, repeatable quality guarantees. Start with a handful of golden examples and deterministic assertions, then progressively layer in LLM-judge evaluations and regression baselines as your system matures.