Skillgrade

"Unit tests" for AI agent skills — an automated evaluation framework supporting multi-agent cross-validation and CI integration.

Overview#

Skillgrade is an automated evaluation framework designed for AI agent skills, built around the concept of "unit tests for skills." It uses SKILL.md as the skill description contract and eval.yaml to define task instructions, workspace mappings, and grading strategies. The framework launches agents in Docker containers or local sandboxes, then scores results via deterministic scripts or LLM rubrics with weighted combination.

Core Capabilities#

Evaluation Core#

End-to-end agent skill testing: Full loop from agent launch, skill discovery, task execution, to result scoring.
Dual grading modes:
- Deterministic Grader: Executes commands, parses stdout JSON, returns 0.0–1.0 score.
- LLM Rubric Grader: Scores based on session transcripts using qualitative rubrics (Gemini or Anthropic models).
- Combinable with weighting: Final reward = Σ(grader_score × weight) / Σ weight.

Multi-Agent Support#

Gemini (Google Gemini CLI)
Claude (Anthropic Claude Code)
Codex (OpenAI Codex CLI)
ACP (Agent Client Protocol compatible agents, JSON-RPC 2.0 over stdio)

AI-Assisted Setup#

skillgrade init leverages LLMs to auto-generate tasks and graders in eval.yaml; without an API key, generates annotated templates.

Run Modes & CI#

Presets: --smoke (5 trials), --reliable (15 trials), --regression (30 trials, high-confidence regression detection).
--ci mode exits with non-zero code when pass rate falls below threshold (default 0.8); --provider=local enables Docker-free CI runs.
--parallel=N for concurrent trial control.

Security & Isolation#

Per-task workspace file mappings for isolated execution in Docker containers or local sandboxes.
.env file auto-loading with shell override support; all values auto-sanitized in persistent logs.

Result Visualization#

CLI report: skillgrade preview
Web UI: skillgrade preview browser (default http://localhost:3847)

Architecture & Implementation#

Directory Structure#

src/          — Core source code
bin/          — CLI entry point
examples/     — Examples (superlint, angular-modern TypeScript grader)
fixtures/     — Test fixtures
graders/      — Grader scripts
skills/       — Built-in skills
templates/    — Templates
tests/        — Tests

Execution Model#

Read eval.yaml configuration.
Prepare workspace file mappings for each task.
Launch target agent in Docker container (default) or local sandbox.
Agent executes the instruction-described task.
Run graders chain: deterministic graders first, then LLM rubric graders.
Calculate weighted scores, aggregate all trials results.
Output report or trigger CI exit code.

ACP Communication#

Launches ACP-compatible agents as child processes using JSON-RPC 2.0 over stdio. Agents handle their own authentication — no direct API key management needed.

TypeScript project using vitest as the test framework. Language breakdown: TypeScript 86.5%, HTML 10.1%, JavaScript 3.4%. Requires Node.js 20+ and Docker.

Installation & Usage#

npm i -g skillgrade

cd my-skill/
GEMINI_API_KEY=your-key skillgrade init
GEMINI_API_KEY=your-key skillgrade --smoke
skillgrade preview
skillgrade preview browser

eval.yaml Example#

version: "1"
defaults:
  agent: gemini
  provider: docker
  trials: 5
  timeout: 300
  threshold: 0.8
  grader_model: gemini-3-flash-preview
tasks:
  - name: fix-linting-errors
    instruction: |
      Use the superlint tool to fix coding standard violations in app.js.
    workspace:
      - src: fixtures/broken-app.js
        dest: app.js
    graders:
      - type: deterministic
        run: bash graders/check.sh
        weight: 0.7
      - type: llm_rubric
        rubric: "Did the agent follow the check → fix → verify workflow?"
        weight: 0.3

Key CLI Options#

Flag	Description
`--smoke` / `--reliable` / `--regression`	Preset trial counts (5/15/30)
`--agent=gemini\|claude\|codex\|acp`	Specify agent
`--provider=docker\|local`	Execution environment
`--ci`	CI mode, non-zero exit on failure
`--threshold=0.8`	CI pass-rate threshold
`--eval=NAME[,NAME]`	Run specific evaluations by name
`--grader=TYPE`	Run only specified grader type
`--parallel=N`	Concurrent trial count
`--validate`	Validate graders with reference solution

Use Cases#

Skill QA: Write repeatable, quantifiable evaluation tests for custom AI agent skills.
Regression detection: Auto-run evaluations after skill updates to detect functional degradation.
Multi-agent cross-validation: Test the same skill across Gemini, Claude, Codex, etc.
CI/CD pipeline integration: Gate skill quality in continuous integration workflows.
Skill development assistance: skillgrade init auto-generates evaluation configs, lowering the barrier to entry.

Ecosystem#

Related project: skills-best-practices — Skill authoring best practices guide by the same author.
Inspired by: SkillsBench and the paper "Demystifying Evals for AI Agents".

Unconfirmed Information#

npm publish status: README mentions npm i -g skillgrade, but GitHub Packages shows "No packages published" — may be on npmjs.com public registry.
Version/Release: No tags or releases in the repo, only main branch.
Blog post: Link points to a March 2026 date; existence and accessibility not verified.
grader_model default gemini-3-flash-preview may be a preview/experimental model — current availability unconfirmed.
ACP protocol specific version and compatible agent list not specified.
Whether LLM Rubric Grader supports OpenAI models as graders is not documented.