Agent Evaluation: Task Completion and Safety
LLM-powered agents can browse the web, write code, execute shell commands, and orchestrate multi-step workflows. But how do you know whether an agent actually works? Traditional NLP metrics like BLEU or perplexity are meaningless when the output is a sequence of actions that modify real-world state. Agent evaluation requires a fundamentally different approach — one that measures task completion, trajectory quality, and safety in tandem.
This post covers the landscape of agent evaluation: the metrics that matter, the leading benchmarks, safety red-teaming techniques, and how to build custom evaluation harnesses that give you confidence before deploying agents to production.
The Agent Evaluation Challenge
Evaluating an autonomous agent is harder than evaluating a single model call for several compounding reasons:
- Multi-step trajectories — An agent may take 5–50 actions to complete a task. A single wrong step can cascade into total failure, yet a suboptimal step may still lead to a correct outcome.
- Non-determinism — LLM sampling, environment state, and tool latency all introduce variance. The same agent can pass a task on run 1 and fail on run 2.
- Side effects — Unlike a text generation benchmark, agents write files, send API calls, and modify databases. Evaluation must account for what the agent changed in the environment.
- Safety constraints — A correct outcome achieved through unsafe means (e.g., exfiltrating credentials, overwriting production data) is worse than a failure.
Task Completion Metrics
The simplest question — did the agent finish the job? — turns out to have many nuanced answers. Here are the core metrics used across the agent evaluation literature:
Success Rate (SR)
The fraction of evaluation episodes where the agent achieves the goal condition. This is the most common top-line metric in agent benchmarks. A goal condition can be a unit test passing, a web page reaching a target state, or a file matching a reference output.
# Success Rate — the foundational agent metric def success_rate(results: list[EvalResult]) -> float: passed = sum(1 for r in results if r.goal_achieved) return passed / len(results) # pass@k — probability that at least one of k attempts succeeds def pass_at_k(n: int, c: int, k: int) -> float: """n = total runs, c = correct runs, k = samples drawn""" from math import comb if n - c < k: return 1.0 return 1.0 - comb(n - c, k) / comb(n, k)
Trajectory Efficiency
Two agents might both solve a task, but one does it in 4 steps and $0.02, while the other takes 38 steps and $1.40. Trajectory metrics capture this distinction:
- Step count — Total number of actions (tool calls, API requests) in the trajectory.
- Token cost — Total input + output tokens consumed across all LLM calls.
- Wall-clock time — End-to-end latency, critical for interactive use cases.
- Redundancy ratio — Fraction of steps that were retries or reverted actions, measuring wasted work.
def trajectory_efficiency(trajectory: list[Action]) -> dict: total_steps = len(trajectory) retries = sum(1 for a in trajectory if a.is_retry) total_tokens = sum(a.input_tokens + a.output_tokens for a in trajectory) total_cost = sum(a.cost_usd for a in trajectory) return { "steps": total_steps, "redundancy_ratio": retries / total_steps if total_steps else 0, "total_tokens": total_tokens, "cost_usd": total_cost, }
Partial Credit Scoring
Binary pass/fail is too coarse for complex tasks. Many benchmarks now award partial credit — e.g., 3 out of 5 sub-goals completed, or 80% of test cases passing. This is especially important for coding benchmarks where an agent might fix the core bug but miss an edge case.
pass@1 (single attempt) and
pass@5 (best of five). The gap between them reveals how much variance your agent has — a large gap
suggests the agent can solve the task but does so unreliably.
Benchmarks: SWE-bench, GAIA, WebArena
The agent evaluation ecosystem has matured rapidly. Here are the three most influential benchmarks, each testing a different modality of agent capability:
🛠️ SWE-bench
Domain: Software engineering
Task: Given a GitHub issue description, produce a patch that resolves the issue and passes the repository's test suite.
Size: 2,294 tasks from 12 popular Python repos (Django, Flask, sympy, scikit-learn, etc.)
Metric: % resolved (patch applies cleanly + all tests pass)
Why it matters: Tests real-world code reasoning across large codebases with complex dependencies. SWE-bench Lite (300 tasks) provides a faster evaluation subset.
🌍 GAIA
Domain: General AI assistant tasks
Task: Answer questions that require multi-step reasoning, web browsing, file processing, and tool use.
Size: 466 questions across 3 difficulty levels
Metric: Exact-match accuracy (the answer is a short factual string)
Why it matters: Tests end-to-end agentic capability — the agent must decide which tools to use, not just how to use them. Level 3 tasks require 10+ reasoning steps.
🌐 WebArena
Domain: Web navigation and interaction
Task: Complete realistic tasks on self-hosted web applications (e-commerce, forums, GitLab, maps) using browser actions.
Size: 812 tasks across 5 web environments
Metric: Task success rate (functional correctness of the final page state)
Why it matters: Tests grounded interaction with real UIs. Agents must handle login flows, search, pagination, and dynamic content. VisualWebArena extends this with vision-based tasks.
🔬 Emerging Benchmarks
HumanEval-Agent — Extends HumanEval with multi-file projects and tool use.
AgentBench — 8 environments (OS, DB, web, game) testing diverse agent capabilities.
ToolBench — 16,000+ real-world REST APIs to test tool selection at scale.
τ-bench — Tests agents on complex, multi-turn customer-service tasks with policy compliance constraints.
ML-bench — Machine learning experimentation tasks requiring code generation and execution.
Safety Evaluation
Task completion tells you what an agent can do; safety evaluation tells you what it shouldn't do. As agents gain access to more powerful tools — shell execution, file I/O, network requests — the blast radius of a misaligned action grows dramatically.
Threat Categories
A comprehensive safety evaluation should cover at least these categories:
- Prompt injection — Can a malicious input in the environment (e.g., a hidden instruction in a webpage) hijack the agent's behavior?
- Privilege escalation — Does the agent attempt to access resources beyond its granted permissions?
- Data exfiltration — Does the agent leak sensitive context (API keys, PII) to external endpoints?
- Destructive actions — Does the agent delete files, drop tables, or overwrite critical data without confirmation?
- Deceptive alignment — Does the agent behave differently when it believes it is being evaluated vs. in production?
Red-Teaming Methodology
Red-teaming is the practice of adversarially probing an agent to discover failure modes. A structured red-team evaluation includes:
- Define the threat model — What tools does the agent have? What's the worst-case outcome of a compromised agent?
- Create adversarial scenarios — Craft inputs designed to trigger each threat category (injection payloads, permission boundary tests, exfiltration honeypots).
- Run blind evaluations — The agent processes adversarial inputs without knowledge that it's being tested.
- Score violations — Categorize each failure by severity (informational, low, medium, high, critical).
- Compute safety metrics — Violation rate, attack success rate (ASR), refusal accuracy on harmful prompts.
class SafetyEvaluator: def __init__(self, agent, scenarios: list[AdversarialScenario]): self.agent = agent self.scenarios = scenarios def run_evaluation(self) -> SafetyReport: results = [] for scenario in self.scenarios: # Create sandboxed environment for each test env = SandboxEnvironment(scenario.initial_state) trajectory = self.agent.run(scenario.task, env) violation = self._check_violation(trajectory, env, scenario) results.append(EvalResult( scenario_id = scenario.id, category = scenario.threat_category, violated = violation is not None, severity = violation.severity if violation else None, trajectory = trajectory, )) return SafetyReport(results) def _check_violation(self, trajectory, env, scenario) -> Violation | None: # Check all actions against safety invariants for action in trajectory.actions: if action.tool == "shell" and self._has_dangerous_cmd(action.input): return Violation("destructive_action", severity="critical") if action.tool == "http" and self._leaks_secrets(action.input, env): return Violation("data_exfiltration", severity="critical") if self._is_prompt_injection_success(action, scenario): return Violation("prompt_injection", severity="high") return None
Safety Metrics
Quantifying safety requires dedicated metrics beyond task success:
- Attack Success Rate (ASR) — Fraction of adversarial scenarios where the agent is successfully exploited.
- Violation Rate — Number of safety violations per 1,000 agent runs in normal (non-adversarial) operation.
- Refusal Accuracy — When presented with a harmful request, how often does the agent correctly refuse?
- Containment Score — Did the agent stay within its sandbox? Measured by monitoring file system, network, and process activity.
Evaluation Frameworks
Several open-source frameworks have emerged to standardize agent evaluation. Choosing the right one depends on your agent architecture and the environments you need to test against.
Inspect AI
By: UK AI Safety Institute
Strength: Safety-first evaluation with built-in sandboxing (Docker), support for tool-use agents, and composable evaluation pipelines. First-class support for red-teaming scenarios.
Best for: Safety-critical deployments, government compliance, red-team evaluations.
AgentEval (AutoGen)
By: Microsoft Research
Strength: Automatic generation of evaluation criteria from task descriptions using CriticAgent. Integrates natively with the AutoGen multi-agent framework.
Best for: Multi-agent systems, rapid prototyping of evaluation criteria.
Braintrust
Strength: Production-grade eval platform with tracing, scoring, and regression tracking. Supports custom scorers and LLM-as-judge evaluations.
Best for: Teams that need CI/CD-integrated evaluation with dashboards and alerting.
METR (Model Evaluation & Threat Research)
Strength: Focus on measuring dangerous capabilities — autonomous replication, resource acquisition, deception. Used for frontier model evaluations.
Best for: Frontier model safety assessments, capability elicitation testing.
# Example: Inspect AI evaluation pipeline from inspect_ai import Task, eval from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_fact from inspect_ai.solver import generate, use_tools from inspect_ai.tool import bash, python def agent_coding_eval(): return Task( dataset = json_dataset("coding_tasks.json"), solver = [ use_tools([bash(timeout=30), python(timeout=30)]), generate(), ], scorer = model_graded_fact(), sandbox = "docker", # Isolated execution max_messages = 25, # Limit agent steps ) # Run the evaluation results = eval(agent_coding_eval(), model="openai/gpt-4o")
Building Custom Agent Evaluations
Benchmarks tell you how your agent compares to others; custom evaluations tell you whether it works for your use case. Here's a practical framework for building domain-specific agent evals.
Step 1: Define Evaluation Dimensions
Start by listing exactly what matters for your deployment. Common dimensions include correctness, efficiency, safety, and user experience. Weight each dimension based on your production requirements.
Step 2: Create a Task Suite
Build a representative set of tasks that cover your agent's intended use cases. Include edge cases, adversarial inputs, and tasks at varying difficulty levels. A good rule of thumb: at least 50 tasks for statistical significance, stratified across difficulty tiers.
# Task suite structure for a code-generation agent class EvalTask: id: str description: str difficulty: str # "easy" | "medium" | "hard" context: dict # Files, environment state validators: list[Validator] # Automated checks max_steps: int # Step budget for the agent timeout_sec: int # Wall-clock timeout # Example validators class UnitTestValidator(Validator): def validate(self, env: Environment) -> Score: result = env.run_command("pytest tests/ -q") passed = result.exit_code == 0 # Extract partial credit from pytest output match = re.search(r"(\d+) passed, (\d+) failed", result.stdout) if match: p, f = int(match.group(1)), int(match.group(2)) return Score(value=p / (p + f), passed=passed) return Score(value=1.0 if passed else 0.0, passed=passed) class SafetyValidator(Validator): def validate(self, env: Environment) -> Score: # Check that no files outside workspace were modified modified = env.get_modified_paths() violations = [p for p in modified if not p.startswith(env.workspace)] return Score(value=1.0 if not violations else 0.0, violations=violations)
Step 3: Implement LLM-as-Judge
For dimensions that are hard to validate programmatically — code quality, explanation clarity, user interaction style — use an LLM judge. The judge sees the agent's trajectory and scores it against a rubric. To reduce bias, use a different model family for the judge than the agent.
def llm_judge_score(trajectory: Trajectory, rubric: str) -> JudgeResult: prompt = f"""You are an expert evaluator. Score the following agent trajectory on a scale of 1-5 according to this rubric: {rubric} Agent trajectory: {trajectory.to_text()} Respond with JSON: {{"score": int, "reasoning": str}}""" response = judge_model.generate(prompt) return JudgeResult(**json.loads(response)) # Rubric example for trajectory quality QUALITY_RUBRIC = """ 1 — Agent loops, makes redundant calls, or never converges 2 — Agent reaches the goal but with significant wasted steps 3 — Agent is mostly efficient with minor detours 4 — Agent takes a clean, logical path with minimal waste 5 — Optimal trajectory — could not be meaningfully improved """
Step 4: Automate with CI/CD
Integrate your evaluation suite into your deployment pipeline. Every agent code change should trigger a regression test. Track metrics over time to catch performance degradation early.
# CI/CD integration — run evals on every PR class AgentEvalPipeline: def __init__(self, task_suite: list[EvalTask], baseline: EvalBaseline): self.task_suite = task_suite self.baseline = baseline def run_regression_check(self, agent) -> PipelineResult: results = [] for task in self.task_suite: env = SandboxEnvironment(task.context) traj = agent.run(task.description, env, max_steps=task.max_steps) scores = {} for v in task.validators: scores[v.name] = v.validate(env) results.append((task, scores, traj)) # Compare against baseline report = self._compare_to_baseline(results) if report.regression_detected: print("❌ Regression detected!") print(report.summary) return PipelineResult(passed=False, report=report) print("✅ All eval checks passed") return PipelineResult(passed=True, report=report)
Step 5: Track and Alert
Store every evaluation run — task ID, trajectory, scores, cost, duration — in a structured format. Build dashboards that surface trends and set up alerts for metric degradation. Key thresholds to monitor:
- Success rate drop > 5% between consecutive evaluations
- Mean cost increase > 20% (often signals the agent is looping)
- Any safety violation in production traffic (zero-tolerance alerting)
- Latency p95 > SLA threshold for interactive agent use cases
Agent evaluation is a rapidly evolving field. The benchmarks of today will be saturated within a year, and new capability frontiers — multi-agent collaboration, long-horizon planning, physical-world interaction — will demand entirely new evaluation paradigms. The organizations that invest in rigorous, custom evaluation infrastructure now will be the ones that deploy agents safely and confidently as the technology matures.