A benchmark dataset for terminal-based AI agents in natural sciences, built from real research workflows across life sciences, physical sciences, earth sciences, and mathematical/computational sciences.
Overview#
Terminal-Bench-Science (TB-Science) is a natural sciences extension of the Terminal-Bench series, jointly developed by Stanford University and Laude Institute, led by Steven Dillmann. It transforms real scientific computing workflows from leading research labs into containerized benchmark tasks, evaluating AI agents' ability to execute end-to-end research tasks in terminal environments through deterministic programmatic verification and optional Agent Judge mechanisms.
Evaluation Design#
- Real Workflow Transformation: Real scientific computing workflows from leading research labs, containerized into executable benchmark tasks
- Deterministic Programmatic Verification: Reproducible binary (0/1) reward signals via
test.sh+ pytest, output to/logs/verifier/reward.txt, generating CTRF format test reports - Agent Judge + Rubric Scoring: Optional LLM-driven review mechanism to supplement programmatic verification for open-ended tasks
- No Oracle Solution Required: Supports evaluation modes that don't depend on reference solutions
Task Characteristics#
- Long-Horizon Cascading Errors: Long task chains where early errors cascade and amplify in subsequent steps
- Rich Environments: Contains real research codebases and instrument data, not simplified toy environments
- Expert-Level Scientific Knowledge: Requires domain expertise for correct execution
- Cross-Disciplinary Coverage: Life sciences (biology, medicine, neuroscience), physical sciences (astronomy, chemistry & materials, physics), earth sciences (atmospheric science, geology, oceanography), mathematical & computational sciences (applied math, scientific computing, data science & statistics)
- Aggressive Difficulty: Targets only 10–20% completion rates by frontier models at launch
Quality Assurance#
- Contamination Prevention: Canary Strings automatically added to task files
- Automated PR Review: Auto-runs task overview, static checks, and 29 LLM-based review criteria on submission
- PR Command System: Supports
/overview,/review,/validate,/run,/cheatcommands
Architecture & Implementation#
The evaluation engine is built on the Harbor framework (Python 90.6%), with TB-Science itself existing as a task dataset and specification. Each task runs in an isolated Docker container with declarative resource requirements via task.toml (CPU, memory, GPU, storage, network access).
Task Directory Structure#
tasks/<domain>/<field>/<task-name>/
├── instruction.md # Agent task instructions
├── task.toml # Configuration & metadata
├── environment/
│ ├── Dockerfile # Container environment
│ └── data/ # Optional: data files
├── solution/
│ └── solve.sh # Reference solution (Oracle)
└── tests/
└── test.sh # Test script
task.toml Core Configuration#
| Section | Description |
|---|---|
[metadata] | Author info, difficulty description, domain/field/subfield tags, expert estimated time |
[verifier] | Verification timeout (default 120s) |
[agent] | Agent runtime timeout (default 1800s, difficult tasks can run hours) |
[environment] | Docker build timeout, CPU/memory/storage/GPU limits, network access permissions |
Task Creation & Admission Flow#
harbor tasks init → Edit files → harbor check (LLM quality review)
→ harbor run -a oracle (solvability verification) → harbor run -a nop (empty agent baseline)
→ PR submission → Automated review (29 criteria) + Human review → Merge
Experimental Features#
- GPU Resource Containers: Supports ML training/inference and simulation tasks (specific GPU types TBD)
- Multi-Container Tasks: Cross-container collaboration evaluation scenarios (support level TBD)
Installation & Quick Start#
Prerequisites: Docker (docker ps must work)
# Install Harbor evaluation framework
uv tool install harbor
# Configure API keys
export ANTHROPIC_API_KEY=<your_anthropic_key>
export OPENAI_API_KEY=<your_openai_key>
export GEMINI_API_KEY=<your_gemini_key>
# Verify task solvability
harbor run -p tasks/<task-domain>/<task-field>/<task-name> -a oracle
# Run AI Agent evaluation
harbor run -p tasks/<task-domain>/<task-field>/<task-name> -a <agent> -m <provider/model>
Quality Checks#
# LLM-driven quality check
harbor check -r rubrics/task-implementation.toml -m anthropic/claude-opus-4-6 tasks/<domain>/<field>/<task-name>
# Local static checks
for check in ci_checks/check-*.sh; do bash "$check" tasks/<domain>/<field>/<task-name>; done
Current Progress & Roadmap#
- 2 tasks merged (Neuroscience 1, Chemistry & Materials 1), targeting 100+
- Started Q1 2026, planned public release and Leaderboard launch in Q3 2026
- Paper planned for Q3 2026 submission to top natural science journals or top ML conferences (not yet published)
- Terminal-Bench series already appears in model cards of Claude Opus 4.6, GPT-5.3-Codex, Gemini 3.1 Pro
- Language breakdown: Python (56.6%), Shell (25.1%), Rust (10.1%), Julia (5.4%), Dockerfile (2.8%)
Unconfirmed Information#
- Specific paper submission venue not yet determined
- Merged task names and descriptions require checking the
tasks/directory - Available GPU types and quota mechanisms for GPU tasks not specified
- Multi-container task orchestration details and network topology not specified
- Complete Agent support list for Harbor framework requires further documentation review