Agent Evaluation: Task Completion and Safety

MLOps Series LLM Agents & Orchestration

LLM-powered agents can browse the web, write code, execute shell commands, and orchestrate multi-step workflows. But how do you know whether an agent actually works? Traditional NLP metrics like BLEU or perplexity are meaningless when the output is a sequence of actions that modify real-world state. Agent evaluation requires a fundamentally different approach — one that measures task completion, trajectory quality, and safety in tandem.

This post covers the landscape of agent evaluation: the metrics that matter, the leading benchmarks, safety red-teaming techniques, and how to build custom evaluation harnesses that give you confidence before deploying agents to production.

The Agent Evaluation Challenge

Evaluating an autonomous agent is harder than evaluating a single model call for several compounding reasons:

Multi-step trajectories — An agent may take 5–50 actions to complete a task. A single wrong step can cascade into total failure, yet a suboptimal step may still lead to a correct outcome.
Non-determinism — LLM sampling, environment state, and tool latency all introduce variance. The same agent can pass a task on run 1 and fail on run 2.
Side effects — Unlike a text generation benchmark, agents write files, send API calls, and modify databases. Evaluation must account for what the agent changed in the environment.
Safety constraints — A correct outcome achieved through unsafe means (e.g., exfiltrating credentials, overwriting production data) is worse than a failure.

Key insight: A "correct" agent that completes 90% of tasks but violates safety constraints 5% of the time is far more dangerous than one that completes 70% of tasks safely. Always evaluate safety independently from task success.

Task Completion Metrics

The simplest question — did the agent finish the job? — turns out to have many nuanced answers. Here are the core metrics used across the agent evaluation literature:

Success Rate (SR)

The fraction of evaluation episodes where the agent achieves the goal condition. This is the most common top-line metric in agent benchmarks. A goal condition can be a unit test passing, a web page reaching a target state, or a file matching a reference output.

# Success Rate — the foundational agent metric
def success_rate(results: list[EvalResult]) -> float:
    passed = sum(1 for r in results if r.goal_achieved)
    return passed / len(results)

# pass@k — probability that at least one of k attempts succeeds
def pass_at_k(n: int, c: int, k: int) -> float:
    """n = total runs, c = correct runs, k = samples drawn"""
    from math import comb
    if n - c < k:
        return 1.0
    return 1.0 - comb(n - c, k) / comb(n, k)

Trajectory Efficiency

Two agents might both solve a task, but one does it in 4 steps and $0.02, while the other takes 38 steps and $1.40. Trajectory metrics capture this distinction:

Step count — Total number of actions (tool calls, API requests) in the trajectory.
Token cost — Total input + output tokens consumed across all LLM calls.
Wall-clock time — End-to-end latency, critical for interactive use cases.
Redundancy ratio — Fraction of steps that were retries or reverted actions, measuring wasted work.

def trajectory_efficiency(trajectory: list[Action]) -> dict:
    total_steps   = len(trajectory)
    retries       = sum(1 for a in trajectory if a.is_retry)
    total_tokens  = sum(a.input_tokens + a.output_tokens for a in trajectory)
    total_cost    = sum(a.cost_usd for a in trajectory)
    return {
        "steps":           total_steps,
        "redundancy_ratio": retries / total_steps if total_steps else 0,
        "total_tokens":     total_tokens,
        "cost_usd":         total_cost,
    }

Partial Credit Scoring

Binary pass/fail is too coarse for complex tasks. Many benchmarks now award partial credit — e.g., 3 out of 5 sub-goals completed, or 80% of test cases passing. This is especially important for coding benchmarks where an agent might fix the core bug but miss an edge case.

Tip: When reporting agent performance, always include both pass@1 (single attempt) and pass@5 (best of five). The gap between them reveals how much variance your agent has — a large gap suggests the agent can solve the task but does so unreliably.

Benchmarks: SWE-bench, GAIA, WebArena

The agent evaluation ecosystem has matured rapidly. Here are the three most influential benchmarks, each testing a different modality of agent capability:

🛠️ SWE-bench

Domain: Software engineering

Task: Given a GitHub issue description, produce a patch that resolves the issue and passes the repository's test suite.

Size: 2,294 tasks from 12 popular Python repos (Django, Flask, sympy, scikit-learn, etc.)

Metric: % resolved (patch applies cleanly + all tests pass)

Why it matters: Tests real-world code reasoning across large codebases with complex dependencies. SWE-bench Lite (300 tasks) provides a faster evaluation subset.

🌍 GAIA

Domain: General AI assistant tasks

Task: Answer questions that require multi-step reasoning, web browsing, file processing, and tool use.

Size: 466 questions across 3 difficulty levels

Metric: Exact-match accuracy (the answer is a short factual string)

Why it matters: Tests end-to-end agentic capability — the agent must decide which tools to use, not just how to use them. Level 3 tasks require 10+ reasoning steps.

🌐 WebArena

Domain: Web navigation and interaction

Task: Complete realistic tasks on self-hosted web applications (e-commerce, forums, GitLab, maps) using browser actions.

Size: 812 tasks across 5 web environments

Metric: Task success rate (functional correctness of the final page state)

Why it matters: Tests grounded interaction with real UIs. Agents must handle login flows, search, pagination, and dynamic content. VisualWebArena extends this with vision-based tasks.

🔬 Emerging Benchmarks

HumanEval-Agent — Extends HumanEval with multi-file projects and tool use.

AgentBench — 8 environments (OS, DB, web, game) testing diverse agent capabilities.

ToolBench — 16,000+ real-world REST APIs to test tool selection at scale.

τ-bench — Tests agents on complex, multi-turn customer-service tasks with policy compliance constraints.

ML-bench — Machine learning experimentation tasks requiring code generation and execution.

Benchmark selection heuristic: Match your benchmark to your deployment domain. If your agent writes code, start with SWE-bench. If it browses the web, use WebArena. If it orchestrates tools, use GAIA. Never rely on a single benchmark — agents that ace SWE-bench may fail spectacularly on web navigation tasks.

Safety Evaluation

Task completion tells you what an agent can do; safety evaluation tells you what it shouldn't do. As agents gain access to more powerful tools — shell execution, file I/O, network requests — the blast radius of a misaligned action grows dramatically.

Threat Categories

A comprehensive safety evaluation should cover at least these categories:

Prompt injection — Can a malicious input in the environment (e.g., a hidden instruction in a webpage) hijack the agent's behavior?
Privilege escalation — Does the agent attempt to access resources beyond its granted permissions?
Data exfiltration — Does the agent leak sensitive context (API keys, PII) to external endpoints?
Destructive actions — Does the agent delete files, drop tables, or overwrite critical data without confirmation?
Deceptive alignment — Does the agent behave differently when it believes it is being evaluated vs. in production?

Warning: Prompt injection is the most critical attack vector for agents. An agent that retrieves web pages or reads user-supplied files is exposed to indirect prompt injection — where the attacker's instructions are embedded in the data the agent processes, not in the user's prompt.

Red-Teaming Methodology

Red-teaming is the practice of adversarially probing an agent to discover failure modes. A structured red-team evaluation includes:

Define the threat model — What tools does the agent have? What's the worst-case outcome of a compromised agent?
Create adversarial scenarios — Craft inputs designed to trigger each threat category (injection payloads, permission boundary tests, exfiltration honeypots).
Run blind evaluations — The agent processes adversarial inputs without knowledge that it's being tested.
Score violations — Categorize each failure by severity (informational, low, medium, high, critical).
Compute safety metrics — Violation rate, attack success rate (ASR), refusal accuracy on harmful prompts.

class SafetyEvaluator:
    def __init__(self, agent, scenarios: list[AdversarialScenario]):
        self.agent     = agent
        self.scenarios = scenarios

    def run_evaluation(self) -> SafetyReport:
        results = []
        for scenario in self.scenarios:
            # Create sandboxed environment for each test
            env = SandboxEnvironment(scenario.initial_state)
            trajectory = self.agent.run(scenario.task, env)

            violation = self._check_violation(trajectory, env, scenario)
            results.append(EvalResult(
                scenario_id = scenario.id,
                category    = scenario.threat_category,
                violated    = violation is not None,
                severity    = violation.severity if violation else None,
                trajectory  = trajectory,
            ))
        return SafetyReport(results)

    def _check_violation(self, trajectory, env, scenario) -> Violation | None:
        # Check all actions against safety invariants
        for action in trajectory.actions:
            if action.tool == "shell" and self._has_dangerous_cmd(action.input):
                return Violation("destructive_action", severity="critical")
            if action.tool == "http" and self._leaks_secrets(action.input, env):
                return Violation("data_exfiltration", severity="critical")
            if self._is_prompt_injection_success(action, scenario):
                return Violation("prompt_injection", severity="high")
        return None

Safety Metrics

Quantifying safety requires dedicated metrics beyond task success:

Attack Success Rate (ASR) — Fraction of adversarial scenarios where the agent is successfully exploited.
Violation Rate — Number of safety violations per 1,000 agent runs in normal (non-adversarial) operation.
Refusal Accuracy — When presented with a harmful request, how often does the agent correctly refuse?
Containment Score — Did the agent stay within its sandbox? Measured by monitoring file system, network, and process activity.

Evaluation Frameworks

Several open-source frameworks have emerged to standardize agent evaluation. Choosing the right one depends on your agent architecture and the environments you need to test against.

Inspect AI

By: UK AI Safety Institute

Strength: Safety-first evaluation with built-in sandboxing (Docker), support for tool-use agents, and composable evaluation pipelines. First-class support for red-teaming scenarios.

Best for: Safety-critical deployments, government compliance, red-team evaluations.

AgentEval (AutoGen)

By: Microsoft Research

Strength: Automatic generation of evaluation criteria from task descriptions using CriticAgent. Integrates natively with the AutoGen multi-agent framework.

Best for: Multi-agent systems, rapid prototyping of evaluation criteria.

Braintrust

Strength: Production-grade eval platform with tracing, scoring, and regression tracking. Supports custom scorers and LLM-as-judge evaluations.

Best for: Teams that need CI/CD-integrated evaluation with dashboards and alerting.

METR (Model Evaluation & Threat Research)

Strength: Focus on measuring dangerous capabilities — autonomous replication, resource acquisition, deception. Used for frontier model evaluations.

Best for: Frontier model safety assessments, capability elicitation testing.

# Example: Inspect AI evaluation pipeline
from inspect_ai import Task, eval
from inspect_ai.dataset import json_dataset
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import generate, use_tools
from inspect_ai.tool import bash, python

def agent_coding_eval():
    return Task(
        dataset   = json_dataset("coding_tasks.json"),
        solver    = [
            use_tools([bash(timeout=30), python(timeout=30)]),
            generate(),
        ],
        scorer    = model_graded_fact(),
        sandbox   = "docker",   # Isolated execution
        max_messages = 25,       # Limit agent steps
    )

# Run the evaluation
results = eval(agent_coding_eval(), model="openai/gpt-4o")

Framework selection tip: If you need safety evaluation, start with Inspect AI — it has the best sandboxing and threat-modeling support. If you need production monitoring with regressions, use Braintrust. For multi-agent orchestration testing, AgentEval integrates cleanly with AutoGen.

Building Custom Agent Evaluations

Benchmarks tell you how your agent compares to others; custom evaluations tell you whether it works for your use case. Here's a practical framework for building domain-specific agent evals.

Step 1: Define Evaluation Dimensions

Start by listing exactly what matters for your deployment. Common dimensions include correctness, efficiency, safety, and user experience. Weight each dimension based on your production requirements.

Step 2: Create a Task Suite

Build a representative set of tasks that cover your agent's intended use cases. Include edge cases, adversarial inputs, and tasks at varying difficulty levels. A good rule of thumb: at least 50 tasks for statistical significance, stratified across difficulty tiers.

# Task suite structure for a code-generation agent
class EvalTask:
    id:          str
    description: str
    difficulty:  str           # "easy" | "medium" | "hard"
    context:     dict          # Files, environment state
    validators:  list[Validator]  # Automated checks
    max_steps:   int           # Step budget for the agent
    timeout_sec: int           # Wall-clock timeout

# Example validators
class UnitTestValidator(Validator):
    def validate(self, env: Environment) -> Score:
        result = env.run_command("pytest tests/ -q")
        passed = result.exit_code == 0
        # Extract partial credit from pytest output
        match = re.search(r"(\d+) passed, (\d+) failed", result.stdout)
        if match:
            p, f = int(match.group(1)), int(match.group(2))
            return Score(value=p / (p + f), passed=passed)
        return Score(value=1.0 if passed else 0.0, passed=passed)

class SafetyValidator(Validator):
    def validate(self, env: Environment) -> Score:
        # Check that no files outside workspace were modified
        modified = env.get_modified_paths()
        violations = [p for p in modified if not p.startswith(env.workspace)]
        return Score(value=1.0 if not violations else 0.0, violations=violations)

Step 3: Implement LLM-as-Judge

For dimensions that are hard to validate programmatically — code quality, explanation clarity, user interaction style — use an LLM judge. The judge sees the agent's trajectory and scores it against a rubric. To reduce bias, use a different model family for the judge than the agent.

def llm_judge_score(trajectory: Trajectory, rubric: str) -> JudgeResult:
    prompt = f"""You are an expert evaluator. Score the following agent
trajectory on a scale of 1-5 according to this rubric:

{rubric}

Agent trajectory:
{trajectory.to_text()}

Respond with JSON: {{"score": int, "reasoning": str}}"""

    response = judge_model.generate(prompt)
    return JudgeResult(**json.loads(response))

# Rubric example for trajectory quality
QUALITY_RUBRIC = """
1 — Agent loops, makes redundant calls, or never converges
2 — Agent reaches the goal but with significant wasted steps
3 — Agent is mostly efficient with minor detours
4 — Agent takes a clean, logical path with minimal waste
5 — Optimal trajectory — could not be meaningfully improved
"""

Step 4: Automate with CI/CD

Integrate your evaluation suite into your deployment pipeline. Every agent code change should trigger a regression test. Track metrics over time to catch performance degradation early.

# CI/CD integration — run evals on every PR
class AgentEvalPipeline:
    def __init__(self, task_suite: list[EvalTask], baseline: EvalBaseline):
        self.task_suite = task_suite
        self.baseline   = baseline

    def run_regression_check(self, agent) -> PipelineResult:
        results = []
        for task in self.task_suite:
            env  = SandboxEnvironment(task.context)
            traj = agent.run(task.description, env, max_steps=task.max_steps)

            scores = {}
            for v in task.validators:
                scores[v.name] = v.validate(env)
            results.append((task, scores, traj))

        # Compare against baseline
        report = self._compare_to_baseline(results)
        if report.regression_detected:
            print("❌ Regression detected!")
            print(report.summary)
            return PipelineResult(passed=False, report=report)
        print("✅ All eval checks passed")
        return PipelineResult(passed=True, report=report)

Common pitfall: Don't evaluate your agent with the same model that powers it. If your agent uses GPT-4o, use Claude or Gemini as the judge. Same-model evaluation inflates scores because the judge shares the agent's biases and blind spots.

Step 5: Track and Alert

Store every evaluation run — task ID, trajectory, scores, cost, duration — in a structured format. Build dashboards that surface trends and set up alerts for metric degradation. Key thresholds to monitor:

Success rate drop > 5% between consecutive evaluations
Mean cost increase > 20% (often signals the agent is looping)
Any safety violation in production traffic (zero-tolerance alerting)
Latency p95 > SLA threshold for interactive agent use cases

Agent evaluation is a rapidly evolving field. The benchmarks of today will be saturated within a year, and new capability frontiers — multi-agent collaboration, long-horizon planning, physical-world interaction — will demand entirely new evaluation paradigms. The organizations that invest in rigorous, custom evaluation infrastructure now will be the ones that deploy agents safely and confidently as the technology matures.