Skip to content

Eval Runner CLI

The Tandem eval runner is the command-line entry point for AI quality evaluations and regression checks.

Agent search terms: eval-runner, eval_runner, EvalRunner, Tandem Eval Runner, AI evaluation CLI, eval CLI, eval_datasets, critical_path.yaml, EngineMode::Simulation, EngineMode::Stub, EngineMode::Live.

Code paths:

  • CLI binary: crates/tandem-eval/src/bin/eval_runner.rs
  • Runner implementation: crates/tandem-eval/src/runner.rs
  • Dataset parser/types: crates/tandem-eval/src/dataset.rs
  • Metrics/results: crates/tandem-eval/src/metrics.rs
  • Regression detection: crates/tandem-eval/src/regression_detection.rs
  • Datasets: eval_datasets/*.yaml
  • Internal developer guide: docs/dev/EVAL_FRAMEWORK.md

Quickstart

Run the critical path dataset in deterministic simulation mode:

Terminal window
cargo run -p tandem-eval --bin eval-runner -- \
--dataset eval_datasets/critical_path.yaml \
--engine-mode simulation \
--output /tmp/tandem-eval-results.json

Simulation mode is the safest default for local checks and CI-style regression gates. It does not call an AI provider, does not require an API key, and should be deterministic.

Read the detailed JSON result:

Terminal window
jq . /tmp/tandem-eval-results.json

Command Reference

Terminal window
cargo run -p tandem-eval --bin eval-runner -- --help

Supported options:

OptionPurpose
--dataset <FILE>Required path to an eval dataset YAML file.
--output <FILE>Results JSON path. Defaults to ./eval_results.json.
--provider <NAME>Provider name for live mode. Defaults to anthropic.
--model <NAME>Model name for live mode. Defaults to claude-haiku-4-5-20251001.
--engine-mode <MODE>simulation, stub, or live. Defaults to simulation.
--simulationLegacy alias for --engine-mode simulation.
--num-workers <N>Parallel worker count. Defaults to 1.
--filter-tag <TAG>Run only test cases containing the given tag.
--max-duration <SECS>Maximum duration per test. Defaults to 300.
--verbosePrint detailed execution output.
--helpPrint CLI help.

Exit codes:

CodeMeaning
0All tests passed.
1One or more tests failed.
2Invalid arguments or dataset load error.

Engine Modes

The eval runner accepts three engine modes:

ModeEngine pathProviderAPI keyBest use
simulationNo live engine path. Uses deterministic simulated outcomes.NoneNoFast local checks, per-PR quality gates, docs examples.
stubReal Tandem engine path with scripted responses.ScriptedEvalProviderNoEngine-path validation and zero-cost baseline captures.
liveReal Tandem engine path with configured provider.Real providerYesHuman-run baseline captures and release confidence checks.

Mode names are parsed case-insensitively. Synonyms include sim for simulation, scripted for stub, and real for live.

Local Engine Bootstrap

When --engine-mode stub or --engine-mode live is used without --engine-token, the CLI bootstraps an isolated in-process AppState. Stub mode swaps in ScriptedEvalProvider; live mode uses configured providers. Passing --engine-token uses the remote engine path instead.

Common Commands

Run every enabled test in the critical path dataset:

Terminal window
cargo run -p tandem-eval --bin eval-runner -- \
--dataset eval_datasets/critical_path.yaml

Write results to a stable file:

Terminal window
cargo run -p tandem-eval --bin eval-runner -- \
--dataset eval_datasets/critical_path.yaml \
--output eval_results.json

Run only tests tagged regression:

Terminal window
cargo run -p tandem-eval --bin eval-runner -- \
--dataset eval_datasets/critical_path.yaml \
--filter-tag regression \
--engine-mode simulation

Run with verbose output:

Terminal window
cargo run -p tandem-eval --bin eval-runner -- \
--dataset eval_datasets/critical_path.yaml \
--engine-mode simulation \
--verbose

Dataset Shape

Eval datasets live under eval_datasets/ and are YAML files with a dataset header plus test_cases.

Important fields for agents editing or generating datasets:

FieldMeaning
nameDataset name.
versionDataset schema/content version.
descriptionHuman-readable purpose.
tagsDataset-level labels.
test_cases[].idStable unique test case id.
test_cases[].enabledWhether the case runs by default.
test_cases[].tagsLabels used by --filter-tag.
test_cases[].automation_specWorkflow-like spec under evaluation.
test_cases[].expected_outputRequired status, validators, output shape, and limits.

Use eval_datasets/critical_path.yaml as the reference dataset before creating a new file.

Result JSON

The runner writes a JSON result file containing aggregate metrics and per-test results. Use it for:

  • pass-rate checks
  • failure-mode review
  • regression comparison
  • release notes and quality evidence
  • debugging individual failed eval cases

The stdout summary is useful for humans, but agents should prefer the JSON file when making decisions.

When To Use This CLI

Use eval-runner when you are checking Tandem AI behavior across a versioned eval dataset.

Use regular Rust tests instead when you are validating a specific function, parser, HTTP handler, or engine unit. See Engine Testing for the broader test matrix.