"Unit tests" for AI agent skills — an automated evaluation framework supporting multi-agent cross-validation and CI integration.
Overview#
Skillgrade is an automated evaluation framework designed for AI agent skills, built around the concept of "unit tests for skills." It uses SKILL.md as the skill description contract and eval.yaml to define task instructions, workspace mappings, and grading strategies. The framework launches agents in Docker containers or local sandboxes, then scores results via deterministic scripts or LLM rubrics with weighted combination.
Core Capabilities#
Evaluation Core#
- End-to-end agent skill testing: Full loop from agent launch, skill discovery, task execution, to result scoring.
- Dual grading modes:
- Deterministic Grader: Executes commands, parses stdout JSON, returns 0.0–1.0 score.
- LLM Rubric Grader: Scores based on session transcripts using qualitative rubrics (Gemini or Anthropic models).
- Combinable with weighting:
Final reward = Σ(grader_score × weight) / Σ weight.
Multi-Agent Support#
- Gemini (Google Gemini CLI)
- Claude (Anthropic Claude Code)
- Codex (OpenAI Codex CLI)
- ACP (Agent Client Protocol compatible agents, JSON-RPC 2.0 over stdio)
AI-Assisted Setup#
skillgrade initleverages LLMs to auto-generate tasks and graders ineval.yaml; without an API key, generates annotated templates.
Run Modes & CI#
- Presets:
--smoke(5 trials),--reliable(15 trials),--regression(30 trials, high-confidence regression detection). --cimode exits with non-zero code when pass rate falls below threshold (default 0.8);--provider=localenables Docker-free CI runs.--parallel=Nfor concurrent trial control.
Security & Isolation#
- Per-task
workspacefile mappings for isolated execution in Docker containers or local sandboxes. .envfile auto-loading with shell override support; all values auto-sanitized in persistent logs.
Result Visualization#
- CLI report:
skillgrade preview - Web UI:
skillgrade preview browser(defaulthttp://localhost:3847)
Architecture & Implementation#
Directory Structure#
src/ — Core source code
bin/ — CLI entry point
examples/ — Examples (superlint, angular-modern TypeScript grader)
fixtures/ — Test fixtures
graders/ — Grader scripts
skills/ — Built-in skills
templates/ — Templates
tests/ — Tests
Execution Model#
- Read
eval.yamlconfiguration. - Prepare workspace file mappings for each task.
- Launch target agent in Docker container (default) or local sandbox.
- Agent executes the instruction-described task.
- Run graders chain: deterministic graders first, then LLM rubric graders.
- Calculate weighted scores, aggregate all trials results.
- Output report or trigger CI exit code.
ACP Communication#
Launches ACP-compatible agents as child processes using JSON-RPC 2.0 over stdio. Agents handle their own authentication — no direct API key management needed.
TypeScript project using vitest as the test framework. Language breakdown: TypeScript 86.5%, HTML 10.1%, JavaScript 3.4%. Requires Node.js 20+ and Docker.
Installation & Usage#
npm i -g skillgrade
cd my-skill/
GEMINI_API_KEY=your-key skillgrade init
GEMINI_API_KEY=your-key skillgrade --smoke
skillgrade preview
skillgrade preview browser
eval.yaml Example#
version: "1"
defaults:
agent: gemini
provider: docker
trials: 5
timeout: 300
threshold: 0.8
grader_model: gemini-3-flash-preview
tasks:
- name: fix-linting-errors
instruction: |
Use the superlint tool to fix coding standard violations in app.js.
workspace:
- src: fixtures/broken-app.js
dest: app.js
graders:
- type: deterministic
run: bash graders/check.sh
weight: 0.7
- type: llm_rubric
rubric: "Did the agent follow the check → fix → verify workflow?"
weight: 0.3
Key CLI Options#
| Flag | Description |
|---|---|
--smoke / --reliable / --regression | Preset trial counts (5/15/30) |
--agent=gemini|claude|codex|acp | Specify agent |
--provider=docker|local | Execution environment |
--ci | CI mode, non-zero exit on failure |
--threshold=0.8 | CI pass-rate threshold |
--eval=NAME[,NAME] | Run specific evaluations by name |
--grader=TYPE | Run only specified grader type |
--parallel=N | Concurrent trial count |
--validate | Validate graders with reference solution |
Use Cases#
- Skill QA: Write repeatable, quantifiable evaluation tests for custom AI agent skills.
- Regression detection: Auto-run evaluations after skill updates to detect functional degradation.
- Multi-agent cross-validation: Test the same skill across Gemini, Claude, Codex, etc.
- CI/CD pipeline integration: Gate skill quality in continuous integration workflows.
- Skill development assistance:
skillgrade initauto-generates evaluation configs, lowering the barrier to entry.
Ecosystem#
- Related project: skills-best-practices — Skill authoring best practices guide by the same author.
- Inspired by: SkillsBench and the paper "Demystifying Evals for AI Agents".
Unconfirmed Information#
- npm publish status: README mentions
npm i -g skillgrade, but GitHub Packages shows "No packages published" — may be on npmjs.com public registry. - Version/Release: No tags or releases in the repo, only
mainbranch. - Blog post: Link points to a March 2026 date; existence and accessibility not verified.
grader_modeldefaultgemini-3-flash-previewmay be a preview/experimental model — current availability unconfirmed.- ACP protocol specific version and compatible agent list not specified.
- Whether LLM Rubric Grader supports OpenAI models as graders is not documented.