Eval Frameworks: lm-eval-harness, OpenAI Evals

MLOps Series LLM Evaluation & Safety

The Eval Framework Landscape

Evaluating large language models requires more than running a handful of prompts manually. As models grow in capability, the community has converged on standardized evaluation frameworks that provide reproducible benchmarks, extensible task definitions, and structured reporting. Choosing the right framework depends on whether you need academic benchmarks, production-grade CI evals, or custom domain-specific tests.

Key insight: No single framework covers every use case. Most mature LLM teams combine an academic harness (for baseline comparisons) with a production eval framework (for regression testing in CI/CD).

The two most widely adopted open-source frameworks are EleutherAI's lm-eval-harness (the engine behind the Open LLM Leaderboard) and OpenAI Evals (designed for evaluating chat/completion models). We'll cover each in depth, then survey the broader ecosystem.

lm-eval-harness: Setup & Usage

The lm-evaluation-harness by EleutherAI is the de-facto standard for reproducible LLM benchmarking. It ships with 200+ built-in tasks spanning reasoning (ARC, HellaSwag), knowledge (MMLU, TriviaQA), code (HumanEval), and more.

Installation

# Install from PyPI (stable) pip install lm-eval # Or install from source for latest tasks git clone https://github.com/EleutherAI/lm-evaluation-harness.git cd lm-evaluation-harness pip install -e .

Running Standard Benchmarks

The CLI is the fastest way to evaluate a model against standard tasks. Below we run MMLU (5-shot) and HellaSwag (10-shot) on a HuggingFace model:

# Evaluate a HuggingFace model on MMLU and HellaSwag lm_eval \ --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks mmlu,hellaswag \ --num_fewshot 5 \ --batch_size 8 \ --device cuda \ --output_path ./results/llama2-7b/

Tip: Use --limit 100 during development to quickly test your setup on a subset of examples before running full evaluations that can take hours.

Evaluating API-Based Models

lm-eval-harness supports API endpoints via the local-completions or openai-completions model type:

# Evaluate an OpenAI-compatible API endpoint lm_eval \ --model local-completions \ --model_args model=my-model,base_url=http://localhost:8000/v1,tokenizer_backend=huggingface \ --tasks arc_easy,winogrande \ --num_fewshot 0 \ --batch_size 1

Python API

For programmatic control — e.g., running evals inside a training loop or CI pipeline — use the Python interface:

import lm_eval # Run evaluation programmatically results = lm_eval.simple_evaluate( model="hf", model_args="pretrained=microsoft/phi-2", tasks=["mmlu", "truthfulqa_mc2"], num_fewshot=5, batch_size=16, device="cuda", ) # Extract scores for task_name, task_results in results["results"].items(): acc = task_results.get("acc,none", "N/A") print(f"{task_name}: accuracy = {acc}")

Understanding Results Output

lm-eval-harness generates a JSON results file with detailed per-task metrics. Key fields include:

acc,none — raw accuracy (no normalization)
acc_norm,none — length-normalized accuracy (important for multiple-choice)
acc_stderr,none — standard error for confidence intervals
alias — human-readable task name

OpenAI Evals: Setup & Usage

OpenAI Evals is a framework for evaluating LLMs (not limited to OpenAI models) using YAML-defined evaluation specifications. It excels at testing chat-based models with custom prompts, expected outputs, and grading criteria.

Installation

# Install OpenAI Evals pip install evals # Or from source git clone https://github.com/openai/evals.git cd evals pip install -e .

YAML Eval Specification Format

Each eval is defined by a YAML registry entry that specifies the eval class, the dataset, and grading criteria:

# evals/registry/evals/my-qa-eval.yaml my-qa-eval: id: my-qa-eval.v1 description: "Custom QA evaluation for domain knowledge" metrics: [accuracy] my-qa-eval.v1: class: evals.elsuite.basic.match:Match args: samples_jsonl: evals/registry/data/my-qa-eval/samples.jsonl

Samples JSONL Format

The dataset is a JSONL file where each line defines an input/expected-output pair:

// samples.jsonl — one JSON object per line {"input": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}], "ideal": "Paris"} {"input": [{"role": "user", "content": "Explain gradient descent in one sentence."}], "ideal": "Gradient descent is an optimization algorithm that iteratively adjusts parameters in the direction of steepest decrease of a loss function."}

Running Evals

# Run the eval against GPT-4 oaieval gpt-4 my-qa-eval # Run against a local model via completion function oaieval my-local-model my-qa-eval --record_path ./results/run1.jsonl

Eval Classes

OpenAI Evals ships with several built-in eval classes for different grading strategies:

Match

Exact string matching between model output and ideal answer. Fast, deterministic. Best for factual QA with short, unambiguous answers.

FuzzyMatch

Normalized string comparison with case/whitespace tolerance. Useful when outputs may vary in formatting but not substance.

Includes

Checks whether the ideal answer appears as a substring in the model output. Good for open-ended responses where the key fact must be present.

ModelGraded

Uses another LLM (e.g., GPT-4) to judge the quality of the response. Most flexible but adds cost and latency. Essential for subjective evaluations.

Warning: The ModelGraded eval class incurs additional API costs since it calls an LLM to grade each sample. Budget for roughly 2× token usage when using model-graded evals.

Custom Eval Creation

Both frameworks support extending beyond built-in tasks. This is critical for domain-specific evaluations — e.g., testing a medical Q&A system or a code generation tool on your proprietary test suite.

Custom Task in lm-eval-harness

Tasks are defined as YAML configurations with optional Python processing functions. Here's a complete custom task definition:

# my_custom_task.yaml — placed in lm_eval/tasks/ task: my_domain_qa dataset_path: json dataset_kwargs: data_files: test: data/my_domain_test.jsonl output_type: generate_until generation_kwargs: until: ["\n", "Question:"] max_gen_toks: 256 temperature: 0.0 doc_to_text: "Question: {{question}}\nAnswer:" doc_to_target: "{{answer}}" metric_list: - metric: exact_match aggregation: mean higher_is_better: true - metric: bleu aggregation: mean higher_is_better: true metadata: version: 1.0

Custom Task with Python Filter

For more complex preprocessing or scoring, attach a Python filter:

# my_custom_filter.py import re def extract_answer(text, doc): """Extract the final answer from chain-of-thought output.""" match = re.search(r"(?:answer is|final answer:?)\s*(.+)", text, re.IGNORECASE) if match: return match.group(1).strip() return text.strip() def custom_metric(predictions, references): """Compute domain-specific accuracy with tolerance.""" correct = 0 for pred, ref in zip(predictions, references): if pred.lower().strip() == ref.lower().strip(): correct += 1 return correct / len(predictions)

Custom Eval Class in OpenAI Evals

For advanced grading logic beyond the built-in classes, subclass Eval:

# custom_eval.py from evals.api import CompletionFn from evals.eval import Eval from evals.record import RecorderBase import json class SemanticSimilarityEval(Eval): """Grade responses by semantic similarity to reference.""" def __init__(self, completion_fns, samples_jsonl, threshold=0.85, **kwargs): super().__init__(completion_fns, **kwargs) self.samples = [json.loads(line) for line in open(samples_jsonl)] self.threshold = threshold def eval_sample(self, sample, rng): prompt = sample["input"] ideal = sample["ideal"] result = self.completion_fn(prompt) output = result.get_completions()[0] # Compute cosine similarity (using your embedding model) similarity = self._compute_similarity(output, ideal) is_correct = similarity >= self.threshold evals.record.record_match( correct=is_correct, expected=ideal, picked=output, metadata={"similarity": similarity} ) def run(self, recorder: RecorderBase): samples = self.get_samples() self.eval_all_samples(recorder, samples) return {"accuracy": evals.metrics.get_accuracy(recorder.get_events("match"))}

Other Frameworks: Promptfoo, DeepEval & More

The eval ecosystem extends well beyond the two major players. Each framework fills a specific niche:

Promptfoo

Promptfoo is a developer-friendly CLI and library for testing prompts across multiple providers. It shines at prompt engineering workflows and regression testing in CI.

# promptfooconfig.yaml prompts: - "Summarize this text: {{text}}" - "Provide a brief summary of: {{text}}" providers: - openai:gpt-4 - openai:gpt-3.5-turbo - ollama:llama2 tests: - vars: text: "The quick brown fox jumps over the lazy dog." assert: - type: contains value: "fox" - type: llm-rubric value: "The summary should be concise and capture the main action." - type: similar value: "A fox jumps over a dog." threshold: 0.8

# Run from CLI npx promptfoo eval npx promptfoo view # Opens web UI with results

DeepEval

DeepEval is a Python testing framework that integrates with pytest and focuses on RAG and LLM application evaluations with built-in metrics:

from deepeval import assert_test from deepeval.test_case import LLMTestCase from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric def test_rag_response(): test_case = LLMTestCase( input="What is RLHF?", actual_output="RLHF stands for Reinforcement Learning from Human Feedback...", retrieval_context=["RLHF is a technique for aligning LLMs..."], expected_output="RLHF is Reinforcement Learning from Human Feedback.", ) relevancy = AnswerRelevancyMetric(threshold=0.7) faithfulness = FaithfulnessMetric(threshold=0.8) assert_test(test_case, [relevancy, faithfulness])

# Run with pytest pytest test_evals.py -v deepeval test run test_evals.py # With dashboard integration

Other Notable Frameworks

HELM (Stanford)

Holistic Evaluation of Language Models. Comprehensive multi-metric framework covering accuracy, calibration, robustness, fairness, efficiency, and bias. Heavy-weight but thorough.

LangSmith Evaluations

Part of the LangChain ecosystem. Deep integration with LangChain traces, custom evaluators, and a hosted dashboard for tracking eval runs over time.

Ragas

Purpose-built for RAG evaluation with metrics like faithfulness, answer relevancy, context precision, and context recall. Lightweight and easy to integrate.

TruLens

Feedback functions for evaluating LLM apps. Focuses on groundedness, relevance, and toxicity. Provides a dashboard for tracking quality over time.

Framework Comparison

Selecting a framework requires weighing several dimensions: benchmark coverage, extensibility, CI integration, cost, and community support.

End-to-End CI Pipeline Example

A practical CI eval pipeline combining lm-eval-harness for baseline benchmarks and Promptfoo for regression tests:

# .github/workflows/llm-eval.yaml name: LLM Evaluation Pipeline on: pull_request: paths: ["prompts/**", "model_configs/**"] jobs: benchmark: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - pip install lm-eval - run: | lm_eval --model local-completions \ --model_args model=our-model,base_url=${{ secrets.MODEL_URL }} \ --tasks mmlu,truthfulqa_mc2 \ --limit 200 \ --output_path ./benchmark_results/ - uses: actions/upload-artifact@v4 with: name: benchmark-results path: ./benchmark_results/ regression: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: npx promptfoo eval --config evals/regression.yaml - run: npx promptfoo eval --config evals/safety.yaml --output results.json - run: | # Fail PR if accuracy drops below threshold python scripts/check_regression.py results.json --threshold 0.95

Choosing the Right Framework

The decision tree below helps map your use case to the best-fit framework:

Practical Recommendations

For research teams: Start with lm-eval-harness. It covers every major academic benchmark, is the standard used by the Open LLM Leaderboard, and gives you reproducible comparisons against published results.

For application developers: Use Promptfoo or DeepEval. Both integrate with CI/CD, support custom assertions, and provide web dashboards. DeepEval is the better choice if you're building RAG applications; Promptfoo excels at multi-provider prompt comparison.

Common mistake: Don't rely on a single eval metric. Combine automated metrics (BLEU, exact match) with LLM-as-judge evaluations and human review for a complete picture. Each captures different failure modes.

Building a Multi-Framework Eval Suite

Here's a Python orchestrator that runs multiple frameworks and aggregates results:

import subprocess import json from pathlib import Path def run_eval_suite(model_name: str, output_dir: str): """Orchestrate multi-framework evaluation.""" results = {} out = Path(output_dir) out.mkdir(parents=True, exist_ok=True) # 1. Academic benchmarks with lm-eval-harness print("[1/3] Running lm-eval-harness benchmarks...") subprocess.run([ "lm_eval", "--model", "hf", "--model_args", f"pretrained={model_name}", "--tasks", "mmlu,hellaswag,arc_challenge", "--num_fewshot", "5", "--output_path", str(out / "lm_eval"), ], check=True) # 2. Prompt regression tests with Promptfoo print("[2/3] Running Promptfoo regression tests...") subprocess.run([ "npx", "promptfoo", "eval", "--config", "evals/promptfoo_config.yaml", "--output", str(out / "promptfoo.json"), ], check=True) # 3. RAG metrics with DeepEval print("[3/3] Running DeepEval RAG tests...") subprocess.run([ "deepeval", "test", "run", "evals/test_rag.py", "--output", str(out / "deepeval.json"), ], check=True) # Aggregate results for result_file in out.glob("*.json"): with open(result_file) as f: results[result_file.stem] = json.load(f) # Write combined report with open(out / "combined_report.json", "w") as f: json.dump(results, f, indent=2) print(f"Eval suite complete. Results in {out}") return results if __name__ == "__main__": run_eval_suite("meta-llama/Llama-2-7b-hf", "./eval_results")

This modular approach lets each framework focus on its strength — academic rigor from lm-eval-harness, prompt-level regression from Promptfoo, and RAG-specific quality from DeepEval — while a single orchestrator produces a unified report for stakeholders.