Skip to content

Eval Runner CLI

The Tandem eval runner is the command-line entry point for AI quality evaluations and regression checks.

Agent search terms: eval-runner, eval_runner, EvalRunner, Tandem Eval Runner, AI evaluation CLI, eval CLI, eval_datasets, critical_path.yaml, EngineMode::Simulation, EngineMode::Stub, EngineMode::Live.

Code paths:

  • CLI binary: crates/tandem-server/src/bin/eval_runner.rs
  • Runner implementation: crates/tandem-server/src/eval/runner.rs
  • Dataset parser/types: crates/tandem-server/src/eval/dataset.rs
  • Metrics/results: crates/tandem-server/src/eval/metrics.rs
  • Regression detection: crates/tandem-server/src/eval/regression_detection.rs
  • Datasets: eval_datasets/*.yaml
  • Internal developer guide: docs/dev/EVAL_FRAMEWORK.md

Quickstart

Run the critical path dataset in deterministic simulation mode:

Terminal window
cargo run -p tandem-server --bin eval-runner -- \
--dataset eval_datasets/critical_path.yaml \
--engine-mode simulation \
--output /tmp/tandem-eval-results.json

Simulation mode is the safest default for local checks and CI-style regression gates. It does not call an AI provider, does not require an API key, and should be deterministic.

Read the detailed JSON result:

Terminal window
jq . /tmp/tandem-eval-results.json

Command Reference

Terminal window
cargo run -p tandem-server --bin eval-runner -- --help

Supported options:

OptionPurpose
--dataset <FILE>Required path to an eval dataset YAML file.
--output <FILE>Results JSON path. Defaults to ./eval_results.json.
--provider <NAME>Provider name for live mode. Defaults to anthropic.
--model <NAME>Model name for live mode. Defaults to claude-haiku-4-5-20251001.
--engine-mode <MODE>simulation, stub, or live. Defaults to simulation.
--simulationLegacy alias for --engine-mode simulation.
--num-workers <N>Parallel worker count. Defaults to 1.
--filter-tag <TAG>Run only test cases containing the given tag.
--max-duration <SECS>Maximum duration per test. Defaults to 300.
--verbosePrint detailed execution output.
--helpPrint CLI help.

Exit codes:

CodeMeaning
0All tests passed.
1One or more tests failed.
2Invalid arguments or dataset load error.

Engine Modes

The eval runner accepts three engine modes:

ModeEngine pathProviderAPI keyBest use
simulationNo live engine path. Uses deterministic simulated outcomes.NoneNoFast local checks, per-PR quality gates, docs examples.
stubReal Tandem engine path with scripted responses.ScriptedEvalProviderNoEngine-path validation after AppState wiring is available.
liveReal Tandem engine path with configured provider.Real providerYesHuman-run baseline captures and release confidence checks after AppState wiring is available.

Mode names are parsed case-insensitively. Synonyms include sim for simulation, scripted for stub, and real for live.

Current Standalone CLI Caveat

As currently wired, the standalone eval-runner binary constructs EvalRunner::new(config) without an AppState.

That means:

  • --engine-mode simulation is the working standalone mode and should be used in guide examples.
  • --engine-mode stub and --engine-mode live are accepted by the parser, but the direct CLI does not yet bootstrap the engine state required by EvalRunner::with_app_state(state).
  • Until that wiring lands, stub/live coverage belongs in tests or workflows that provide an AppState.

Do not tell agents or users that the standalone CLI can fully execute stub/live evals unless crates/tandem-server/src/bin/eval_runner.rs has been updated to construct and pass an AppState.

Common Commands

Run every enabled test in the critical path dataset:

Terminal window
cargo run -p tandem-server --bin eval-runner -- \
--dataset eval_datasets/critical_path.yaml

Write results to a stable file:

Terminal window
cargo run -p tandem-server --bin eval-runner -- \
--dataset eval_datasets/critical_path.yaml \
--output eval_results.json

Run only tests tagged regression:

Terminal window
cargo run -p tandem-server --bin eval-runner -- \
--dataset eval_datasets/critical_path.yaml \
--filter-tag regression \
--engine-mode simulation

Run with verbose output:

Terminal window
cargo run -p tandem-server --bin eval-runner -- \
--dataset eval_datasets/critical_path.yaml \
--engine-mode simulation \
--verbose

Dataset Shape

Eval datasets live under eval_datasets/ and are YAML files with a dataset header plus test_cases.

Important fields for agents editing or generating datasets:

FieldMeaning
nameDataset name.
versionDataset schema/content version.
descriptionHuman-readable purpose.
tagsDataset-level labels.
test_cases[].idStable unique test case id.
test_cases[].enabledWhether the case runs by default.
test_cases[].tagsLabels used by --filter-tag.
test_cases[].automation_specWorkflow-like spec under evaluation.
test_cases[].expected_outputRequired status, validators, output shape, and limits.

Use eval_datasets/critical_path.yaml as the reference dataset before creating a new file.

Result JSON

The runner writes a JSON result file containing aggregate metrics and per-test results. Use it for:

  • pass-rate checks
  • failure-mode review
  • regression comparison
  • release notes and quality evidence
  • debugging individual failed eval cases

The stdout summary is useful for humans, but agents should prefer the JSON file when making decisions.

When To Use This CLI

Use eval-runner when you are checking Tandem AI behavior across a versioned eval dataset.

Use regular Rust tests instead when you are validating a specific function, parser, HTTP handler, or engine unit. See Engine Testing for the broader test matrix.