Prompt Testing: Automated Evaluation

MLOps Series Prompt Engineering

Prompts are code — brittle, version-sensitive, and user-facing. A single word change can flip a model from producing reliable JSON to hallucinating wildly. Yet most teams still evaluate prompts by "vibes": paste a few examples into a playground, eyeball the outputs, and ship. This works until it doesn't — a model update silently degrades your summarizer, a new edge case triggers toxic output, or a refactored system prompt breaks a downstream parser.

This post introduces a disciplined, automated approach to prompt testing: building deterministic test suites, curating golden datasets, detecting regressions across model versions, using LLMs to judge other LLMs, and wiring everything into CI/CD so every prompt change is gated by evidence.

Why Test Prompts

Traditional software testing asserts deterministic behavior: function f(x) always returns y. LLM outputs are stochastic, context-dependent, and sensitive to seemingly trivial changes. This makes testing harder — but also more important. Three forces make prompt testing non-negotiable in production:

Prompt Fragility

Research from Microsoft and others shows that reordering few-shot examples, changing a single delimiter, or swapping "Answer:" for "Response:" can shift accuracy by 5–20 percentage points on benchmarks. Prompts are the highest-leverage surface in an LLM application, and they are also the most fragile.

Model Update Risk

When a provider pushes a new model snapshot (e.g., gpt-4-0613 → gpt-4-1106-preview), your prompts run against a different function. Behavioral changes are undocumented, often subtle, and may not surface for weeks. Without automated tests, you are flying blind through every model update.

Regression at Scale

A team with 50+ prompts across agents, RAG pipelines, and classifiers cannot manually QA every change. Regression testing must be automated, fast, and integrated into the development workflow — just like unit tests for application code.

Anti-pattern: Relying on a single "golden prompt" with no test coverage. When it breaks after a model update, the team spends days debugging because there's no baseline to compare against.

Building Prompt Test Suites

A prompt test suite is a structured collection of input → expected behavior pairs, with assertions that verify the model output meets criteria. Unlike unit tests, prompt tests rarely assert exact string equality. Instead, they check structural, semantic, and constraint-based properties.

Test Case Anatomy

Every test case should capture four elements:

Input — The user message, context documents, or variables injected into the prompt template.
Expected behavior — What the output must satisfy (format, content, constraints).
Assertion type — How to evaluate (exact match, contains, regex, semantic similarity, LLM-judge).
Metadata — Tags for category, priority, edge-case type, creation date.

# prompt_test_suite.py — Structured test case definition

from dataclasses import dataclass, field
from typing import List, Optional, Callable
import re, json

@dataclass
class PromptTestCase:
    name: str
    prompt_template: str
    variables: dict
    assertions: List[Assertion]
    tags: List[str] = field(default_factory=list)
    priority: str = "medium"  # low | medium | high | critical

@dataclass
class Assertion:
    type: str          # "contains" | "regex" | "json_schema" | "semantic" | "llm_judge"
    expected: str
    threshold: float = 0.85  # for similarity-based assertions

# Example test case for a summarization prompt
test_summarize = PromptTestCase(
    name="summarize_earnings_call",
    prompt_template="Summarize the following transcript in exactly 3 bullet points:\n\n{transcript}",
    variables={"transcript": EARNINGS_CALL_TEXT},
    assertions=[
        Assertion(type="regex",    expected=r"^[-•]\s.+(\n[-•]\s.+){2}$"),  # exactly 3 bullets
        Assertion(type="contains", expected="revenue"),                      # must mention revenue
        Assertion(type="semantic", expected=GOLDEN_SUMMARY, threshold=0.80),  # cosine sim
    ],
    tags=["summarization", "finance"],
    priority="critical"
)

Assertion Types

Deterministic Assertions

exact_match — Output equals expected string
contains — Output includes a substring
not_contains — Output excludes banned phrases
regex — Output matches a pattern
json_schema — Output parses and validates against schema
length — Token or character count within bounds

Probabilistic Assertions

semantic_similarity — Cosine similarity ≥ threshold
llm_judge — A judge model scores on rubric
classification — Output classified into expected label
entailment — NLI model confirms output entails reference
toxicity — Safety classifier score below threshold
coherence — Perplexity or fluency score above threshold

# Test runner — executing and evaluating a suite

class PromptTestRunner:
    def __init__(self, llm_client, embedding_model, judge_model=None):
        self.llm = llm_client
        self.embedder = embedding_model
        self.judge = judge_model

    def run_suite(self, suite: List[PromptTestCase]) -> TestReport:
        results = []
        for tc in suite:
            prompt = tc.prompt_template.format(**tc.variables)
            output = self.llm.generate(prompt, temperature=0)

            passed = True
            details = []
            for a in tc.assertions:
                result = self._evaluate(output, a)
                details.append(result)
                if not result.passed:
                    passed = False

            results.append(TestResult(
                name=tc.name, passed=passed,
                output=output, assertion_details=details
            ))
        return TestReport(results=results)

    def _evaluate(self, output: str, assertion: Assertion) -> AssertionResult:
        if assertion.type == "contains":
            return AssertionResult(passed=assertion.expected in output)
        elif assertion.type == "regex":
            return AssertionResult(passed=bool(re.search(assertion.expected, output)))
        elif assertion.type == "json_schema":
            return self._validate_json(output, assertion.expected)
        elif assertion.type == "semantic":
            score = self._cosine_sim(output, assertion.expected)
            return AssertionResult(passed=score >= assertion.threshold, score=score)
        elif assertion.type == "llm_judge":
            return self._judge_eval(output, assertion)
        raise ValueError(f"Unknown assertion type: {assertion.type}")

Tip: Always run prompt tests at temperature=0 (or a fixed seed when available) to maximize reproducibility. Save the raw output alongside pass/fail for debugging.

Golden Datasets

A golden dataset is a curated, versioned collection of input-output pairs that represent the expected behavior of your prompt across the full range of production scenarios. It serves as the ground truth for all automated evaluations.

Curation Principles

Representative coverage — Include common cases, edge cases, adversarial inputs, and multilingual examples proportional to real traffic.
Human-verified — Every golden output should be reviewed and approved by a domain expert, not just copied from model output.
Versioned alongside prompts — Golden datasets live in version control next to the prompt templates they evaluate.
Living documents — Update the golden set when you discover new failure modes in production.

# golden_dataset.py — Schema and loading

import json
from pathlib import Path
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class GoldenExample:
    id: str
    input_text: str
    expected_output: str
    tags: List[str]
    category: str           # "common" | "edge_case" | "adversarial"
    verified_by: str        # human reviewer name
    created_at: str

class GoldenDataset:
    def __init__(self, path: str):
        self.path = Path(path)
        self.examples = self._load()
        self.version = self._compute_hash()

    def _load(self) -> List[GoldenExample]:
        data = json.loads(self.path.read_text())
        return [GoldenExample(**ex) for ex in data["examples"]]

    def coverage_report(self) -> Dict[str, int]:
        """Count examples per category for coverage auditing."""
        counts = {}
        for ex in self.examples:
            counts[ex.category] = counts.get(ex.category, 0) + 1
        return counts

    def filter_by_tag(self, tag: str) -> List[GoldenExample]:
        return [ex for ex in self.examples if tag in ex.tags]

    def _compute_hash(self) -> str:
        import hashlib
        content = self.path.read_bytes()
        return hashlib.sha256(content).hexdigest()[:12]

Coverage Metrics

Rule of thumb: A production-quality golden dataset should have at least 50 examples for a classification prompt and 100+ for generation tasks. Allocate ≥20% to edge cases and adversarial inputs.

Regression Testing

Regression testing detects when prompt changes or model updates degrade performance relative to a known baseline. The key challenge is that LLM outputs are stochastic — you cannot simply diff strings. Instead, you must compare aggregate metrics and establish statistical significance.

Baseline → Candidate Comparison

The regression testing workflow compares a baseline (last known-good prompt + model) against a candidate (the proposed change). For each golden example, run both and compare:

# regression_tester.py — Compare baseline vs candidate

import numpy as np
from scipy import stats

class RegressionTester:
    def __init__(self, runner: PromptTestRunner, golden: GoldenDataset):
        self.runner = runner
        self.golden = golden

    def compare(self, baseline_prompt: str, candidate_prompt: str,
                  model: str, threshold: float = 0.02) -> RegressionResult:
        """Run both prompts on golden data and compare metrics."""

        baseline_scores = []
        candidate_scores = []

        for ex in self.golden.examples:
            b_out = self.runner.llm.generate(
                baseline_prompt.format(input=ex.input_text), temperature=0
            )
            c_out = self.runner.llm.generate(
                candidate_prompt.format(input=ex.input_text), temperature=0
            )
            b_score = self._score(b_out, ex.expected_output)
            c_score = self._score(c_out, ex.expected_output)
            baseline_scores.append(b_score)
            candidate_scores.append(c_score)

        # Paired t-test for statistical significance
        t_stat, p_value = stats.ttest_rel(candidate_scores, baseline_scores)
        mean_diff = np.mean(candidate_scores) - np.mean(baseline_scores)

        passed = mean_diff >= -threshold or p_value > 0.05
        return RegressionResult(
            passed=passed,
            baseline_mean=np.mean(baseline_scores),
            candidate_mean=np.mean(candidate_scores),
            p_value=p_value,
            degraded_examples=self._find_degraded(baseline_scores, candidate_scores)
        )

    def _find_degraded(self, b_scores, c_scores, drop_threshold=0.15):
        """Flag individual examples where candidate is significantly worse."""
        degraded = []
        for i, (b, c) in enumerate(zip(b_scores, c_scores)):
            if b - c > drop_threshold:
                degraded.append({
                    "index": i,
                    "baseline_score": b,
                    "candidate_score": c,
                    "drop": round(b - c, 4)
                })
        return degraded

Pitfall: A single aggregate metric can mask localized regressions. Always inspect the per-example breakdown — a prompt may improve average accuracy while completely failing on a critical edge-case category.

LLM-as-Judge Evaluation

Many prompt outputs — creative text, explanations, conversational responses — cannot be evaluated with deterministic assertions or embedding similarity alone. LLM-as-judge uses a separate (often stronger) model to score candidate outputs against a rubric, mimicking human evaluation at scale.

Single-Point vs. Pairwise Evaluation

Single-Point Grading

The judge scores one output on an absolute rubric (e.g., 1–5 for helpfulness). Simple to implement but sensitive to position bias and score calibration.

Use when: You have one candidate output per input and need an absolute quality metric.

Pairwise Comparison

The judge compares two outputs (A vs. B) and picks a winner. More robust to calibration drift but requires 2× inference cost.

Use when: Comparing a baseline prompt against a candidate, or ranking multiple prompt variants.

# llm_judge.py — LLM-as-Judge evaluator

JUDGE_RUBRIC_TEMPLATE = """You are an expert evaluator. Score the following output
on a scale of 1-5 for each criterion.

[Input]: {input_text}
[Output]: {model_output}
[Reference]: {reference_output}

Criteria:
1. **Accuracy** — Are all facts correct and consistent with the reference?
2. **Completeness** — Does the output cover all key points?
3. **Conciseness** — Is the output free of unnecessary repetition?
4. **Format compliance** — Does the output follow the requested format?

Respond in JSON: {{"accuracy": N, "completeness": N, "conciseness": N, "format": N}}
"""

PAIRWISE_TEMPLATE = """Compare these two outputs for the given input.

[Input]: {input_text}
[Output A]: {output_a}
[Output B]: {output_b}

Which output is better overall? Respond with exactly "A" or "B" and a one-sentence reason.
"""

class LLMJudge:
    def __init__(self, judge_client, model="gpt-4o"):
        self.client = judge_client
        self.model = model

    def grade_single(self, input_text, output, reference) -> dict:
        prompt = JUDGE_RUBRIC_TEMPLATE.format(
            input_text=input_text, model_output=output,
            reference_output=reference
        )
        response = self.client.generate(prompt, model=self.model, temperature=0)
        return json.loads(response)

    def pairwise_compare(self, input_text, output_a, output_b) -> str:
        # Randomize order to mitigate position bias
        import random
        if random.random() > 0.5:
            output_a, output_b = output_b, output_a
            swapped = True
        else:
            swapped = False

        prompt = PAIRWISE_TEMPLATE.format(
            input_text=input_text, output_a=output_a, output_b=output_b
        )
        verdict = self.client.generate(prompt, model=self.model, temperature=0)
        winner = "A" if "A" in verdict[:5] else "B"

        # Correct for swap
        if swapped:
            winner = "B" if winner == "A" else "A"
        return winner

    def evaluate_suite(self, golden: GoldenDataset, outputs: List[str]) -> dict:
        scores = {"accuracy": [], "completeness": [], "conciseness": [], "format": []}
        for ex, out in zip(golden.examples, outputs):
            grade = self.grade_single(ex.input_text, out, ex.expected_output)
            for k in scores:
                scores[k].append(grade[k])
        return {k: round(np.mean(v), 2) for k, v in scores.items()}

Important: LLM judges exhibit well-documented biases — position bias (favoring the first option), verbosity bias (preferring longer outputs), and self-preference (favoring outputs from the same model family). Always randomize presentation order and calibrate with human agreement studies.

CI/CD Integration

The final step is wiring prompt tests into your CI/CD pipeline so every prompt change — whether a template edit, model version bump, or golden dataset update — is automatically evaluated before reaching production. This turns prompt engineering from an ad-hoc craft into a software engineering discipline.

GitHub Actions Workflow

# .github/workflows/prompt-tests.yml

name: Prompt Regression Tests

on:
  pull_request:
    paths:
      - "prompts/**"
      - "golden_datasets/**"
      - "prompt_tests/**"
  schedule:
    - cron: "0 6 * * 1"  # Weekly model drift check

jobs:
  prompt-eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install -r requirements-prompt-tests.txt

      - name: Run prompt test suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python -m prompt_tests.runner \
            --suite prompt_tests/suites/ \
            --golden golden_datasets/ \
            --output results/report.json \
            --fail-on-regression

      - name: Upload test report
        uses: actions/upload-artifact@v4
        with:
          name: prompt-test-report
          path: results/report.json

      - name: Post summary to PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const report = JSON.parse(fs.readFileSync('results/report.json'));
            const summary = [
              `## 🧪 Prompt Test Results`,
              `| Metric | Baseline | Candidate | Δ |`,
              `|--------|----------|-----------|---|`,
              ...report.metrics.map(m =>
                `| ${m.name} | ${m.baseline} | ${m.candidate} | ${m.delta} |`
              ),
              `\n**Verdict:** ${report.passed ? '✅ Pass' : '❌ Fail'}`
            ].join('\n');
            github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: summary
            });

Quality Gates

Define automated gates that block merges when prompts regress:

Hard Gates (Block Merge)

Any critical test case fails
Aggregate accuracy drops > 2%
JSON schema validation failures > 0
Toxicity score exceeds threshold

Soft Gates (Require Review)

Non-critical test failures > 5%
LLM judge scores decline on any axis
New untested prompt templates detected
Golden dataset coverage below target

# gate_evaluator.py — Automated quality gates

from dataclasses import dataclass
from typing import List

@dataclass
class GateResult:
    name: str
    gate_type: str   # "hard" | "soft"
    passed: bool
    message: str

class QualityGateEvaluator:
    def __init__(self, config: dict):
        self.config = config

    def evaluate(self, report: TestReport, regression: RegressionResult) -> List[GateResult]:
        gates = []

        # Hard gate: critical test failures
        critical_failures = [r for r in report.results
                             if r.priority == "critical" and not r.passed]
        gates.append(GateResult(
            name="critical_tests", gate_type="hard",
            passed=len(critical_failures) == 0,
            message=f"{len(critical_failures)} critical test(s) failed"
        ))

        # Hard gate: accuracy regression
        max_drop = self.config.get("max_accuracy_drop", 0.02)
        accuracy_drop = regression.baseline_mean - regression.candidate_mean
        gates.append(GateResult(
            name="accuracy_regression", gate_type="hard",
            passed=accuracy_drop <= max_drop,
            message=f"Accuracy drop: {accuracy_drop:.3f} (max: {max_drop})"
        ))

        # Soft gate: non-critical failure rate
        total = len(report.results)
        failures = len([r for r in report.results if not r.passed])
        fail_rate = failures / total if total > 0 else 0
        gates.append(GateResult(
            name="overall_fail_rate", gate_type="soft",
            passed=fail_rate <= 0.05,
            message=f"Fail rate: {fail_rate:.1%} ({failures}/{total})"
        ))

        return gates

    def should_block_merge(self, gates: List[GateResult]) -> bool:
        return any(g.gate_type == "hard" and not g.passed for g in gates)

    def needs_review(self, gates: List[GateResult]) -> bool:
        return any(g.gate_type == "soft" and not g.passed for g in gates)

Pro tip: Run a weekly scheduled pipeline (even without code changes) to catch model drift. API providers silently update model weights — your Monday suite may fail even though no one touched a line of code.

Prompt testing transforms LLM applications from "works on my machine" experiments into production-grade systems with measurable, repeatable quality guarantees. Start with a handful of golden examples and deterministic assertions, then progressively layer in LLM-judge evaluations and regression baselines as your system matures.