DISCOVER THE FUTURE OF AI AGENTSarrow_forward

Harbor

calendar_todayAdded Apr 24, 2026
categoryAgent & Tooling
codeOpen Source
PythonWorkflow AutomationDockerPyTorch大语言模型AI AgentsReinforcement LearningCLIAgent & ToolingModel & Inference FrameworkEducation & Research ResourcesModel Training & Inference

A framework for evaluating and optimizing agents and models in container environments, supporting standardized benchmarking, large-scale parallel execution, and RL training data generation.

Core Positioning#

Harbor addresses the fundamental challenge of safely, reproducibly, and at scale evaluating CLI-based AI agents (such as Claude Code, OpenHands, Codex CLI) and their underlying language models through standardized benchmarking and reinforcement learning optimization. Harbor focuses on "environment orchestration + agent scheduling + result collection" — it does not include LLM inference or agent logic itself. Agent behavior is driven by external CLI tools, and model switching is delegated to litellm via the --model parameter.

Capability Matrix#

ThemeFeatureDescription
Agent EvaluationStandardized CLI agent benchmarkingPre-integrated with Claude Code, OpenHands, Codex CLI, etc., specified via --agent
Benchmark RegistryBuilt-in dataset browsing & versioningharbor datasets list shows SWE-Bench, Aider Polyglot, Terminal-Bench-2.0, etc.
Sandbox IsolationContainerized task executionAll tasks run in isolated containers for safety and reproducibility
Large-Scale ParallelismMulti-cloud sandbox backend schedulingSupports Daytona, Modal, E2B, Runloop, ISLO, TensorLake, GKE — up to thousands of concurrent environments
RL OptimizationRollout data generation & framework integrationGenerates RL training traces; integrated with SkyRL, GEPA
Custom ExtensionsUser-built benchmarks & environmentsSupports building and sharing custom evaluation benchmarks and containerized task environments
CI/CD IntegrationAgent testing in continuous integrationEmbeddable in CI pipelines
Reward ModelingRewardKit sub-packagepackages/rewardkit provides standalone reward modeling capabilities
VisualizationFrontend viewerapps/viewer provides web-based result viewing
Skill SystemReusable skill definitionsskills/ + skills-lock.json manage skill declarations and version locking

Use Cases#

  • Horizontal comparison of agent/model capabilities on standard benchmarks (Terminal-Bench-2.0, SWE-Bench, Aider Polyglot, etc.)
  • Enterprise-specific task sets for agent regression testing
  • Batch generation of agent execution traces for RL training data
  • Automated agent capability testing embedded in CI/CD workflows
  • Large-scale experiments with hundreds to thousands of concurrent environments via cloud sandboxes

Architecture#

Repository Structure (monorepo style):

  • src/harbor/ — Core framework source, CLI entry at harbor.cli.main:app (Typer-based)
  • packages/rewardkit/ — Reward model sub-package, independent Python package
  • adapters/ — Agent adapter implementation layer, unifying different CLI agents into the framework
  • apps/viewer/ — Frontend visualization app (TypeScript)
  • skills/ + skills-lock.json — Skill definitions and lock files
  • rfcs/ — Design documents
  • docs/, examples/ — Documentation and examples

Core Dependencies: Typer (CLI), Rich (terminal rendering), Pydantic (data validation), litellm (multi-LLM calls), HuggingFace datasets (data fetching), Jinja2 (template rendering), FastAPI + Uvicorn (server), Supabase (backend storage). Build system uses uv_build + uv package manager.

Execution Flow: User specifies dataset, agent, model, and concurrency via CLI → Framework fetches task definitions from registry → Schedules sandbox environments per concurrency strategy (local Docker or cloud backends) → Launches agent in isolated container to execute tasks → Collects execution traces and results → Optionally feeds into RewardKit for scoring or exports as RL rollout data.

Installation & Quick Start#

Requirements: Python ≥ 3.12

# Install (either)
uv tool install harbor
pip install harbor

# Cloud backend optional dependencies (install as needed)
harbor[e2b] / harbor[daytona] / harbor[modal] / harbor[runloop] / harbor[gke] / harbor[tensorlake] / harbor[islo]
harbor[cloud]    # All cloud backends
harbor[tinker]   # Tinker ecosystem integration

Local Docker — Terminal-Bench-2.0:

export ANTHROPIC_API_KEY=<YOUR-KEY>
harbor run --dataset terminal-bench@2.0 \
   --agent claude-code \
   --model anthropic/claude-opus-4-1 \
   --n-concurrent 4

Cloud Parallel (Daytona, 100 concurrent):

export ANTHROPIC_API_KEY=<YOUR-KEY>
export DAYTONA_API_KEY=<YOUR-KEY>
harbor run --dataset terminal-bench@2.0 \
   --agent claude-code \
   --model anthropic/claude-opus-4-1 \
   --n-concurrent 100 \
   --env daytona

CLI Core#

  • Entry commands: harbor / hr / hb (all equivalent)
  • harbor run --dataset <dataset@version> --agent <agent> --model <model> — Run evaluation
  • harbor datasets list — List available benchmark datasets
  • Key parameters: --dataset (dataset@version), --agent (agent under test), --model (litellm format), --n-concurrent (concurrency), --env (sandbox backend)

Ecosystem#

  • Upstream: Created by the Terminal-Bench authors; Terminal-Bench-2.0 is the default built-in benchmark
  • Agent integrations: Claude Code, OpenHands, Codex CLI
  • Cloud sandboxes: Daytona, Modal, E2B, Runloop, ISLO, TensorLake, GKE
  • RL frameworks: SkyRL, GEPA
  • Datasets: SWE-Bench, Aider Polyglot, Terminal-Bench-2.0 (browsable via registry)

Notes#

  • No formal paper link found for Harbor itself; the relationship between Terminal-Bench papers and Harbor is to be confirmed.
  • ISLO and Tinker integration specifics are not detailed in public materials.
  • The online registry (https://registry.harborframework.com/) is being upgraded; its current functionality is to be confirmed.
  • FastAPI endpoint documentation and RewardKit detailed API are pending further documentation.
  • Current version: 0.5.0, Apache-2.0 license.

Related Projects

View All arrow_forward

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.

rocket_launch