Terminal-Bench-Science

A benchmark dataset for terminal-based AI agents in natural sciences, built from real research workflows across life sciences, physical sciences, earth sciences, and mathematical/computational sciences.

Overview#

Terminal-Bench-Science (TB-Science) is a natural sciences extension of the Terminal-Bench series, jointly developed by Stanford University and Laude Institute, led by Steven Dillmann. It transforms real scientific computing workflows from leading research labs into containerized benchmark tasks, evaluating AI agents' ability to execute end-to-end research tasks in terminal environments through deterministic programmatic verification and optional Agent Judge mechanisms.

Evaluation Design#

Real Workflow Transformation: Real scientific computing workflows from leading research labs, containerized into executable benchmark tasks
Deterministic Programmatic Verification: Reproducible binary (0/1) reward signals via test.sh + pytest, output to /logs/verifier/reward.txt, generating CTRF format test reports
Agent Judge + Rubric Scoring: Optional LLM-driven review mechanism to supplement programmatic verification for open-ended tasks
No Oracle Solution Required: Supports evaluation modes that don't depend on reference solutions

Task Characteristics#

Long-Horizon Cascading Errors: Long task chains where early errors cascade and amplify in subsequent steps
Rich Environments: Contains real research codebases and instrument data, not simplified toy environments
Expert-Level Scientific Knowledge: Requires domain expertise for correct execution
Cross-Disciplinary Coverage: Life sciences (biology, medicine, neuroscience), physical sciences (astronomy, chemistry & materials, physics), earth sciences (atmospheric science, geology, oceanography), mathematical & computational sciences (applied math, scientific computing, data science & statistics)
Aggressive Difficulty: Targets only 10–20% completion rates by frontier models at launch

Quality Assurance#

Contamination Prevention: Canary Strings automatically added to task files
Automated PR Review: Auto-runs task overview, static checks, and 29 LLM-based review criteria on submission
PR Command System: Supports /overview, /review, /validate, /run, /cheat commands

Architecture & Implementation#

The evaluation engine is built on the Harbor framework (Python 90.6%), with TB-Science itself existing as a task dataset and specification. Each task runs in an isolated Docker container with declarative resource requirements via task.toml (CPU, memory, GPU, storage, network access).

Task Directory Structure#

tasks/<domain>/<field>/<task-name>/
├── instruction.md          # Agent task instructions
├── task.toml               # Configuration & metadata
├── environment/
│   ├── Dockerfile          # Container environment
│   └── data/               # Optional: data files
├── solution/
│   └── solve.sh            # Reference solution (Oracle)
└── tests/
    └── test.sh             # Test script

task.toml Core Configuration#

Section	Description
`[metadata]`	Author info, difficulty description, domain/field/subfield tags, expert estimated time
`[verifier]`	Verification timeout (default 120s)
`[agent]`	Agent runtime timeout (default 1800s, difficult tasks can run hours)
`[environment]`	Docker build timeout, CPU/memory/storage/GPU limits, network access permissions

Task Creation & Admission Flow#

harbor tasks init → Edit files → harbor check (LLM quality review)
→ harbor run -a oracle (solvability verification) → harbor run -a nop (empty agent baseline)
→ PR submission → Automated review (29 criteria) + Human review → Merge

Experimental Features#

GPU Resource Containers: Supports ML training/inference and simulation tasks (specific GPU types TBD)
Multi-Container Tasks: Cross-container collaboration evaluation scenarios (support level TBD)

Installation & Quick Start#

Prerequisites: Docker (docker ps must work)

# Install Harbor evaluation framework
uv tool install harbor

# Configure API keys
export ANTHROPIC_API_KEY=<your_anthropic_key>
export OPENAI_API_KEY=<your_openai_key>
export GEMINI_API_KEY=<your_gemini_key>

# Verify task solvability
harbor run -p tasks/<task-domain>/<task-field>/<task-name> -a oracle

# Run AI Agent evaluation
harbor run -p tasks/<task-domain>/<task-field>/<task-name> -a <agent> -m <provider/model>

Quality Checks#

# LLM-driven quality check
harbor check -r rubrics/task-implementation.toml -m anthropic/claude-opus-4-6 tasks/<domain>/<field>/<task-name>

# Local static checks
for check in ci_checks/check-*.sh; do bash "$check" tasks/<domain>/<field>/<task-name>; done

Current Progress & Roadmap#

2 tasks merged (Neuroscience 1, Chemistry & Materials 1), targeting 100+
Started Q1 2026, planned public release and Leaderboard launch in Q3 2026
Paper planned for Q3 2026 submission to top natural science journals or top ML conferences (not yet published)
Terminal-Bench series already appears in model cards of Claude Opus 4.6, GPT-5.3-Codex, Gemini 3.1 Pro
Language breakdown: Python (56.6%), Shell (25.1%), Rust (10.1%), Julia (5.4%), Dockerfile (2.8%)

Unconfirmed Information#

Specific paper submission venue not yet determined
Merged task names and descriptions require checking the tasks/ directory
Available GPU types and quota mechanisms for GPU tasks not specified
Multi-container task orchestration details and network topology not specified
Complete Agent support list for Harbor framework requires further documentation review