Eval Runner CLI

The Tandem eval runner is the command-line entry point for AI quality evaluations and regression checks.

Agent search terms: eval-runner, eval_runner, EvalRunner, Tandem Eval Runner, AI evaluation CLI, eval CLI, eval_datasets, critical_path.yaml, EngineMode::Simulation, EngineMode::Stub, EngineMode::Live.

Code paths:

CLI binary: crates/tandem-server/src/bin/eval_runner.rs
Runner implementation: crates/tandem-server/src/eval/runner.rs
Dataset parser/types: crates/tandem-server/src/eval/dataset.rs
Metrics/results: crates/tandem-server/src/eval/metrics.rs
Regression detection: crates/tandem-server/src/eval/regression_detection.rs
Datasets: eval_datasets/*.yaml
Internal developer guide: docs/dev/EVAL_FRAMEWORK.md

Quickstart

Run the critical path dataset in deterministic simulation mode:

cargo run -p tandem-server --bin eval-runner -- \
  --dataset eval_datasets/critical_path.yaml \
  --engine-mode simulation \
  --output /tmp/tandem-eval-results.json

Simulation mode is the safest default for local checks and CI-style regression gates. It does not call an AI provider, does not require an API key, and should be deterministic.

Read the detailed JSON result:

jq . /tmp/tandem-eval-results.json

Command Reference

cargo run -p tandem-server --bin eval-runner -- --help

Supported options:

Option	Purpose
`--dataset <FILE>`	Required path to an eval dataset YAML file.
`--output <FILE>`	Results JSON path. Defaults to `./eval_results.json`.
`--provider <NAME>`	Provider name for live mode. Defaults to `anthropic`.
`--model <NAME>`	Model name for live mode. Defaults to `claude-haiku-4-5-20251001`.
`--engine-mode <MODE>`	`simulation`, `stub`, or `live`. Defaults to `simulation`.
`--simulation`	Legacy alias for `--engine-mode simulation`.
`--num-workers <N>`	Parallel worker count. Defaults to `1`.
`--filter-tag <TAG>`	Run only test cases containing the given tag.
`--max-duration <SECS>`	Maximum duration per test. Defaults to `300`.
`--verbose`	Print detailed execution output.
`--help`	Print CLI help.

Exit codes:

Code	Meaning
`0`	All tests passed.
`1`	One or more tests failed.
`2`	Invalid arguments or dataset load error.

Engine Modes

The eval runner accepts three engine modes:

Mode	Engine path	Provider	API key	Best use
`simulation`	No live engine path. Uses deterministic simulated outcomes.	None	No	Fast local checks, per-PR quality gates, docs examples.
`stub`	Real Tandem engine path with scripted responses.	`ScriptedEvalProvider`	No	Engine-path validation after AppState wiring is available.
`live`	Real Tandem engine path with configured provider.	Real provider	Yes	Human-run baseline captures and release confidence checks after AppState wiring is available.

Mode names are parsed case-insensitively. Synonyms include sim for simulation, scripted for stub, and real for live.

Current Standalone CLI Caveat

As currently wired, the standalone eval-runner binary constructs EvalRunner::new(config) without an AppState.

That means:

--engine-mode simulation is the working standalone mode and should be used in guide examples.
--engine-mode stub and --engine-mode live are accepted by the parser, but the direct CLI does not yet bootstrap the engine state required by EvalRunner::with_app_state(state).
Until that wiring lands, stub/live coverage belongs in tests or workflows that provide an AppState.

Do not tell agents or users that the standalone CLI can fully execute stub/live evals unless crates/tandem-server/src/bin/eval_runner.rs has been updated to construct and pass an AppState.

Common Commands

Run every enabled test in the critical path dataset:

cargo run -p tandem-server --bin eval-runner -- \
  --dataset eval_datasets/critical_path.yaml

Write results to a stable file:

cargo run -p tandem-server --bin eval-runner -- \
  --dataset eval_datasets/critical_path.yaml \
  --output eval_results.json

Run only tests tagged regression:

cargo run -p tandem-server --bin eval-runner -- \
  --dataset eval_datasets/critical_path.yaml \
  --filter-tag regression \
  --engine-mode simulation

Run with verbose output:

cargo run -p tandem-server --bin eval-runner -- \
  --dataset eval_datasets/critical_path.yaml \
  --engine-mode simulation \
  --verbose

Dataset Shape

Eval datasets live under eval_datasets/ and are YAML files with a dataset header plus test_cases.

Important fields for agents editing or generating datasets:

Field	Meaning
`name`	Dataset name.
`version`	Dataset schema/content version.
`description`	Human-readable purpose.
`tags`	Dataset-level labels.
`test_cases[].id`	Stable unique test case id.
`test_cases[].enabled`	Whether the case runs by default.
`test_cases[].tags`	Labels used by `--filter-tag`.
`test_cases[].automation_spec`	Workflow-like spec under evaluation.
`test_cases[].expected_output`	Required status, validators, output shape, and limits.

Use eval_datasets/critical_path.yaml as the reference dataset before creating a new file.

Result JSON

The runner writes a JSON result file containing aggregate metrics and per-test results. Use it for:

pass-rate checks
failure-mode review
regression comparison
release notes and quality evidence
debugging individual failed eval cases

The stdout summary is useful for humans, but agents should prefer the JSON file when making decisions.

When To Use This CLI

Use eval-runner when you are checking Tandem AI behavior across a versioned eval dataset.

Use regular Rust tests instead when you are validating a specific function, parser, HTTP handler, or engine unit. See Engine Testing for the broader test matrix.