DISCOVER THE FUTURE OF AI AGENTSarrow_forward

SanityHarness

calendar_todayAdded Feb 24, 2026
categoryAgent & Tooling
codeOpen Source
Workflow AutomationDocker大语言模型GoAI AgentsCLIAgent & ToolingModel & Inference FrameworkDeveloper Tools & CodingAutomation, Workflow & RPA

A lightweight evaluation harness for coding agents featuring Docker isolation and weighted scoring, supporting 6 programming languages and 19 major coding agents.

SanityHarness is a CLI tool written in Go designed to standardize the evaluation of LLM coding agents.

Core Capabilities

  • Docker container isolation for secure task execution
  • 26 coding tasks across 6 languages: Go, Rust, TypeScript, Kotlin, Dart, Zig
  • Built-in integration for 19 coding agents (Claude Code, Gemini, Codex CLI, Cline, Copilot, Kimi, Qwen, Goose, Junie, Kilocode, Amp, Crush, Pi, etc.)
  • Difficulty-based weighted scoring system for fair comparison
  • BLAKE3 hash integrity verification to prevent result tampering
  • Bubblewrap sandbox isolation to limit agent system access
  • Parallel evaluation (--parallel), Watch mode, resumable runs

Use Cases

  • Regression testing and capability assessment for coding agent development teams
  • Comparing different LLMs on code generation tasks for researchers
  • Benchmarking before selecting coding assistance tools for enterprises

Requirements

  • Go 1.25+
  • Docker (running daemon)
  • bubblewrap (optional, for agent sandbox isolation)

Quick Start

git clone https://github.com/lemon07r/sanityharness.git
cd sanityharness
make tools && make build
./sanity list
./sanity eval --agent gemini --tier all --parallel 4

Core Commands

  • ./sanity list [--language <lang>] [--tier <tier>] - List tasks
  • ./sanity run <task> [--watch] - Run single task
  • ./sanity eval --agent <name> [--model <model>] [--parallel N] - Evaluate agent
  • ./sanity show <session-path> - View results
  • ./sanity verify <path> - Verify submission integrity

Architecture

  • CLI Layer: Built on Cobra
  • Task System: Task files embedded at compile time for zero-dependency distribution
  • Runtime: Containers stay running, reused via docker exec to reduce overhead
  • Config: Supports ./sanity.toml, ~/.sanity.toml, ~/.config/sanity/config.toml

Output Structure

  • summary.json - Complete results with weighted scores
  • attestation.json - BLAKE3 hash verification
  • report.md - Human-readable report
  • submission.json - Leaderboard format submission file

Project Info

Related Projects

View All arrow_forward

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.

rocket_launch