Harbor

A framework for evaluating and optimizing agents and models in container environments, supporting standardized benchmarking, large-scale parallel execution, and RL training data generation.

Core Positioning#

Harbor addresses the fundamental challenge of safely, reproducibly, and at scale evaluating CLI-based AI agents (such as Claude Code, OpenHands, Codex CLI) and their underlying language models through standardized benchmarking and reinforcement learning optimization. Harbor focuses on "environment orchestration + agent scheduling + result collection" — it does not include LLM inference or agent logic itself. Agent behavior is driven by external CLI tools, and model switching is delegated to litellm via the --model parameter.

Capability Matrix#

Theme	Feature	Description
Agent Evaluation	Standardized CLI agent benchmarking	Pre-integrated with Claude Code, OpenHands, Codex CLI, etc., specified via `--agent`
Benchmark Registry	Built-in dataset browsing & versioning	`harbor datasets list` shows SWE-Bench, Aider Polyglot, Terminal-Bench-2.0, etc.
Sandbox Isolation	Containerized task execution	All tasks run in isolated containers for safety and reproducibility
Large-Scale Parallelism	Multi-cloud sandbox backend scheduling	Supports Daytona, Modal, E2B, Runloop, ISLO, TensorLake, GKE — up to thousands of concurrent environments
RL Optimization	Rollout data generation & framework integration	Generates RL training traces; integrated with SkyRL, GEPA
Custom Extensions	User-built benchmarks & environments	Supports building and sharing custom evaluation benchmarks and containerized task environments
CI/CD Integration	Agent testing in continuous integration	Embeddable in CI pipelines
Reward Modeling	RewardKit sub-package	`packages/rewardkit` provides standalone reward modeling capabilities
Visualization	Frontend viewer	`apps/viewer` provides web-based result viewing
Skill System	Reusable skill definitions	`skills/` + `skills-lock.json` manage skill declarations and version locking

Use Cases#

Horizontal comparison of agent/model capabilities on standard benchmarks (Terminal-Bench-2.0, SWE-Bench, Aider Polyglot, etc.)
Enterprise-specific task sets for agent regression testing
Batch generation of agent execution traces for RL training data
Automated agent capability testing embedded in CI/CD workflows
Large-scale experiments with hundreds to thousands of concurrent environments via cloud sandboxes

Architecture#

Repository Structure (monorepo style):

src/harbor/ — Core framework source, CLI entry at harbor.cli.main:app (Typer-based)
packages/rewardkit/ — Reward model sub-package, independent Python package
adapters/ — Agent adapter implementation layer, unifying different CLI agents into the framework
apps/viewer/ — Frontend visualization app (TypeScript)
skills/ + skills-lock.json — Skill definitions and lock files
rfcs/ — Design documents
docs/, examples/ — Documentation and examples

Core Dependencies: Typer (CLI), Rich (terminal rendering), Pydantic (data validation), litellm (multi-LLM calls), HuggingFace datasets (data fetching), Jinja2 (template rendering), FastAPI + Uvicorn (server), Supabase (backend storage). Build system uses uv_build + uv package manager.

Execution Flow: User specifies dataset, agent, model, and concurrency via CLI → Framework fetches task definitions from registry → Schedules sandbox environments per concurrency strategy (local Docker or cloud backends) → Launches agent in isolated container to execute tasks → Collects execution traces and results → Optionally feeds into RewardKit for scoring or exports as RL rollout data.

Installation & Quick Start#

Requirements: Python ≥ 3.12

# Install (either)
uv tool install harbor
pip install harbor

# Cloud backend optional dependencies (install as needed)
harbor[e2b] / harbor[daytona] / harbor[modal] / harbor[runloop] / harbor[gke] / harbor[tensorlake] / harbor[islo]
harbor[cloud]    # All cloud backends
harbor[tinker]   # Tinker ecosystem integration

Local Docker — Terminal-Bench-2.0:

export ANTHROPIC_API_KEY=<YOUR-KEY>
harbor run --dataset terminal-bench@2.0 \
   --agent claude-code \
   --model anthropic/claude-opus-4-1 \
   --n-concurrent 4

Cloud Parallel (Daytona, 100 concurrent):

export ANTHROPIC_API_KEY=<YOUR-KEY>
export DAYTONA_API_KEY=<YOUR-KEY>
harbor run --dataset terminal-bench@2.0 \
   --agent claude-code \
   --model anthropic/claude-opus-4-1 \
   --n-concurrent 100 \
   --env daytona

CLI Core#

Entry commands: harbor / hr / hb (all equivalent)
harbor run --dataset <dataset@version> --agent <agent> --model <model> — Run evaluation
harbor datasets list — List available benchmark datasets
Key parameters: --dataset (dataset@version), --agent (agent under test), --model (litellm format), --n-concurrent (concurrency), --env (sandbox backend)

Ecosystem#

Upstream: Created by the Terminal-Bench authors; Terminal-Bench-2.0 is the default built-in benchmark
Agent integrations: Claude Code, OpenHands, Codex CLI
Cloud sandboxes: Daytona, Modal, E2B, Runloop, ISLO, TensorLake, GKE
RL frameworks: SkyRL, GEPA
Datasets: SWE-Bench, Aider Polyglot, Terminal-Bench-2.0 (browsable via registry)

Notes#

No formal paper link found for Harbor itself; the relationship between Terminal-Bench papers and Harbor is to be confirmed.
ISLO and Tinker integration specifics are not detailed in public materials.
The online registry (https://registry.harborframework.com/) is being upgraded; its current functionality is to be confirmed.
FastAPI endpoint documentation and RewardKit detailed API are pending further documentation.
Current version: 0.5.0, Apache-2.0 license.

Core Positioning#

Capability Matrix#

Use Cases#

Architecture#

Installation & Quick Start#

CLI Core#

Ecosystem#

Notes#

Related Projects

OpenGuardrails

mcpx

Slack MCP Server

STAY UPDATED