A framework for evaluating and optimizing agents and models in container environments, supporting standardized benchmarking, large-scale parallel execution, and RL training data generation.
Core Positioning#
Harbor addresses the fundamental challenge of safely, reproducibly, and at scale evaluating CLI-based AI agents (such as Claude Code, OpenHands, Codex CLI) and their underlying language models through standardized benchmarking and reinforcement learning optimization. Harbor focuses on "environment orchestration + agent scheduling + result collection" — it does not include LLM inference or agent logic itself. Agent behavior is driven by external CLI tools, and model switching is delegated to litellm via the --model parameter.
Capability Matrix#
| Theme | Feature | Description |
|---|---|---|
| Agent Evaluation | Standardized CLI agent benchmarking | Pre-integrated with Claude Code, OpenHands, Codex CLI, etc., specified via --agent |
| Benchmark Registry | Built-in dataset browsing & versioning | harbor datasets list shows SWE-Bench, Aider Polyglot, Terminal-Bench-2.0, etc. |
| Sandbox Isolation | Containerized task execution | All tasks run in isolated containers for safety and reproducibility |
| Large-Scale Parallelism | Multi-cloud sandbox backend scheduling | Supports Daytona, Modal, E2B, Runloop, ISLO, TensorLake, GKE — up to thousands of concurrent environments |
| RL Optimization | Rollout data generation & framework integration | Generates RL training traces; integrated with SkyRL, GEPA |
| Custom Extensions | User-built benchmarks & environments | Supports building and sharing custom evaluation benchmarks and containerized task environments |
| CI/CD Integration | Agent testing in continuous integration | Embeddable in CI pipelines |
| Reward Modeling | RewardKit sub-package | packages/rewardkit provides standalone reward modeling capabilities |
| Visualization | Frontend viewer | apps/viewer provides web-based result viewing |
| Skill System | Reusable skill definitions | skills/ + skills-lock.json manage skill declarations and version locking |
Use Cases#
- Horizontal comparison of agent/model capabilities on standard benchmarks (Terminal-Bench-2.0, SWE-Bench, Aider Polyglot, etc.)
- Enterprise-specific task sets for agent regression testing
- Batch generation of agent execution traces for RL training data
- Automated agent capability testing embedded in CI/CD workflows
- Large-scale experiments with hundreds to thousands of concurrent environments via cloud sandboxes
Architecture#
Repository Structure (monorepo style):
src/harbor/— Core framework source, CLI entry atharbor.cli.main:app(Typer-based)packages/rewardkit/— Reward model sub-package, independent Python packageadapters/— Agent adapter implementation layer, unifying different CLI agents into the frameworkapps/viewer/— Frontend visualization app (TypeScript)skills/+skills-lock.json— Skill definitions and lock filesrfcs/— Design documentsdocs/,examples/— Documentation and examples
Core Dependencies: Typer (CLI), Rich (terminal rendering), Pydantic (data validation), litellm (multi-LLM calls), HuggingFace datasets (data fetching), Jinja2 (template rendering), FastAPI + Uvicorn (server), Supabase (backend storage). Build system uses uv_build + uv package manager.
Execution Flow: User specifies dataset, agent, model, and concurrency via CLI → Framework fetches task definitions from registry → Schedules sandbox environments per concurrency strategy (local Docker or cloud backends) → Launches agent in isolated container to execute tasks → Collects execution traces and results → Optionally feeds into RewardKit for scoring or exports as RL rollout data.
Installation & Quick Start#
Requirements: Python ≥ 3.12
# Install (either)
uv tool install harbor
pip install harbor
# Cloud backend optional dependencies (install as needed)
harbor[e2b] / harbor[daytona] / harbor[modal] / harbor[runloop] / harbor[gke] / harbor[tensorlake] / harbor[islo]
harbor[cloud] # All cloud backends
harbor[tinker] # Tinker ecosystem integration
Local Docker — Terminal-Bench-2.0:
export ANTHROPIC_API_KEY=<YOUR-KEY>
harbor run --dataset terminal-bench@2.0 \
--agent claude-code \
--model anthropic/claude-opus-4-1 \
--n-concurrent 4
Cloud Parallel (Daytona, 100 concurrent):
export ANTHROPIC_API_KEY=<YOUR-KEY>
export DAYTONA_API_KEY=<YOUR-KEY>
harbor run --dataset terminal-bench@2.0 \
--agent claude-code \
--model anthropic/claude-opus-4-1 \
--n-concurrent 100 \
--env daytona
CLI Core#
- Entry commands:
harbor/hr/hb(all equivalent) harbor run --dataset <dataset@version> --agent <agent> --model <model>— Run evaluationharbor datasets list— List available benchmark datasets- Key parameters:
--dataset(dataset@version),--agent(agent under test),--model(litellm format),--n-concurrent(concurrency),--env(sandbox backend)
Ecosystem#
- Upstream: Created by the Terminal-Bench authors; Terminal-Bench-2.0 is the default built-in benchmark
- Agent integrations: Claude Code, OpenHands, Codex CLI
- Cloud sandboxes: Daytona, Modal, E2B, Runloop, ISLO, TensorLake, GKE
- RL frameworks: SkyRL, GEPA
- Datasets: SWE-Bench, Aider Polyglot, Terminal-Bench-2.0 (browsable via registry)
Notes#
- No formal paper link found for Harbor itself; the relationship between Terminal-Bench papers and Harbor is to be confirmed.
- ISLO and Tinker integration specifics are not detailed in public materials.
- The online registry (https://registry.harborframework.com/) is being upgraded; its current functionality is to be confirmed.
- FastAPI endpoint documentation and RewardKit detailed API are pending further documentation.
- Current version: 0.5.0, Apache-2.0 license.