Evaluating large language models requires more than running a handful of prompts manually. As models grow in capability, the community has converged on standardized evaluation frameworks that provide reproducible benchmarks, extensible task definitions, and structured reporting. Choosing the right framework depends on whether you need academic benchmarks, production-grade CI evals, or custom domain-specific tests.
Key insight: No single framework covers every use case. Most mature LLM teams combine an academic harness (for baseline comparisons) with a production eval framework (for regression testing in CI/CD).
The two most widely adopted open-source frameworks are EleutherAI's lm-eval-harness (the engine behind the Open LLM Leaderboard) and OpenAI Evals (designed for evaluating chat/completion models). We'll cover each in depth, then survey the broader ecosystem.
lm-eval-harness: Setup & Usage
The lm-evaluation-harness by EleutherAI is the de-facto standard for reproducible LLM benchmarking. It ships with 200+ built-in tasks spanning reasoning (ARC, HellaSwag), knowledge (MMLU, TriviaQA), code (HumanEval), and more.
Installation
# Install from PyPI (stable)pip install lm-eval
# Or install from source for latest tasksgit clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
Running Standard Benchmarks
The CLI is the fastest way to evaluate a model against standard tasks. Below we run MMLU (5-shot) and HellaSwag (10-shot) on a HuggingFace model:
# Evaluate a HuggingFace model on MMLU and HellaSwaglm_eval \
--model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,hellaswag \
--num_fewshot 5 \
--batch_size 8 \
--device cuda \
--output_path ./results/llama2-7b/
Tip: Use --limit 100 during development to quickly test your setup on a subset of examples before running full evaluations that can take hours.
Evaluating API-Based Models
lm-eval-harness supports API endpoints via the local-completions or openai-completions model type:
lm-eval-harness generates a JSON results file with detailed per-task metrics. Key fields include:
acc,none — raw accuracy (no normalization)
acc_norm,none — length-normalized accuracy (important for multiple-choice)
acc_stderr,none — standard error for confidence intervals
alias — human-readable task name
OpenAI Evals: Setup & Usage
OpenAI Evals is a framework for evaluating LLMs (not limited to OpenAI models) using YAML-defined evaluation specifications. It excels at testing chat-based models with custom prompts, expected outputs, and grading criteria.
Installation
# Install OpenAI Evalspip install evals
# Or from sourcegit clone https://github.com/openai/evals.git
cd evals
pip install -e .
YAML Eval Specification Format
Each eval is defined by a YAML registry entry that specifies the eval class, the dataset, and grading criteria:
The dataset is a JSONL file where each line defines an input/expected-output pair:
// samples.jsonl — one JSON object per line
{"input": [{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}],
"ideal": "Paris"}
{"input": [{"role": "user", "content": "Explain gradient descent in one sentence."}],
"ideal": "Gradient descent is an optimization algorithm that iteratively adjusts parameters in the direction of steepest decrease of a loss function."}
Running Evals
# Run the eval against GPT-4oaieval gpt-4 my-qa-eval
# Run against a local model via completion functionoaieval my-local-model my-qa-eval --record_path ./results/run1.jsonl
Eval Classes
OpenAI Evals ships with several built-in eval classes for different grading strategies:
Match
Exact string matching between model output and ideal answer. Fast, deterministic. Best for factual QA with short, unambiguous answers.
FuzzyMatch
Normalized string comparison with case/whitespace tolerance. Useful when outputs may vary in formatting but not substance.
Includes
Checks whether the ideal answer appears as a substring in the model output. Good for open-ended responses where the key fact must be present.
ModelGraded
Uses another LLM (e.g., GPT-4) to judge the quality of the response. Most flexible but adds cost and latency. Essential for subjective evaluations.
Warning: The ModelGraded eval class incurs additional API costs since it calls an LLM to grade each sample. Budget for roughly 2× token usage when using model-graded evals.
Custom Eval Creation
Both frameworks support extending beyond built-in tasks. This is critical for domain-specific evaluations — e.g., testing a medical Q&A system or a code generation tool on your proprietary test suite.
Custom Task in lm-eval-harness
Tasks are defined as YAML configurations with optional Python processing functions. Here's a complete custom task definition:
The eval ecosystem extends well beyond the two major players. Each framework fills a specific niche:
Promptfoo
Promptfoo is a developer-friendly CLI and library for testing prompts across multiple providers. It shines at prompt engineering workflows and regression testing in CI.
# promptfooconfig.yaml
prompts:
- "Summarize this text: {{text}}"
- "Provide a brief summary of: {{text}}"
providers:
- openai:gpt-4
- openai:gpt-3.5-turbo
- ollama:llama2
tests:
- vars:
text: "The quick brown fox jumps over the lazy dog."
assert:
- type: contains
value: "fox"
- type: llm-rubric
value: "The summary should be concise and capture the main action."
- type: similar
value: "A fox jumps over a dog."
threshold: 0.8
# Run from CLInpx promptfoo eval
npx promptfoo view # Opens web UI with results
DeepEval
DeepEval is a Python testing framework that integrates with pytest and focuses on RAG and LLM application evaluations with built-in metrics:
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
deftest_rag_response():
test_case = LLMTestCase(
input="What is RLHF?",
actual_output="RLHF stands for Reinforcement Learning from Human Feedback...",
retrieval_context=["RLHF is a technique for aligning LLMs..."],
expected_output="RLHF is Reinforcement Learning from Human Feedback.",
)
relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.8)
assert_test(test_case, [relevancy, faithfulness])
# Run with pytestpytest test_evals.py -v
deepeval test run test_evals.py # With dashboard integration
Other Notable Frameworks
HELM (Stanford)
Holistic Evaluation of Language Models. Comprehensive multi-metric framework covering accuracy, calibration, robustness, fairness, efficiency, and bias. Heavy-weight but thorough.
LangSmith Evaluations
Part of the LangChain ecosystem. Deep integration with LangChain traces, custom evaluators, and a hosted dashboard for tracking eval runs over time.
Ragas
Purpose-built for RAG evaluation with metrics like faithfulness, answer relevancy, context precision, and context recall. Lightweight and easy to integrate.
TruLens
Feedback functions for evaluating LLM apps. Focuses on groundedness, relevance, and toxicity. Provides a dashboard for tracking quality over time.
Framework Comparison
Selecting a framework requires weighing several dimensions: benchmark coverage, extensibility, CI integration, cost, and community support.
End-to-End CI Pipeline Example
A practical CI eval pipeline combining lm-eval-harness for baseline benchmarks and Promptfoo for regression tests:
The decision tree below helps map your use case to the best-fit framework:
Practical Recommendations
For research teams: Start with lm-eval-harness. It covers every major academic benchmark, is the standard used by the Open LLM Leaderboard, and gives you reproducible comparisons against published results.
For application developers: Use Promptfoo or DeepEval. Both integrate with CI/CD, support custom assertions, and provide web dashboards. DeepEval is the better choice if you're building RAG applications; Promptfoo excels at multi-provider prompt comparison.
Common mistake: Don't rely on a single eval metric. Combine automated metrics (BLEU, exact match) with LLM-as-judge evaluations and human review for a complete picture. Each captures different failure modes.
Building a Multi-Framework Eval Suite
Here's a Python orchestrator that runs multiple frameworks and aggregates results:
This modular approach lets each framework focus on its strength — academic rigor from lm-eval-harness, prompt-level regression from Promptfoo, and RAG-specific quality from DeepEval — while a single orchestrator produces a unified report for stakeholders.