An open-source framework for large language model evaluations from the UK AI Safety Institute, featuring a modular Datasets/Solvers/Scorers architecture, multi-model/tool support, sandboxed execution, and 100+ pre-built benchmarks.
Overview#
Inspect is an open-source framework for large language model evaluations developed by the UK AI Safety Institute (AISI). It aims to provide unified, extensible evaluation standards and tooling. The project uses MIT License and is hosted on the UK Government's official GitHub organization.
Core Architecture#
Modular design centered on Task: Dataset (Input) -> Solver (Processing/Reasoning) -> Scorer (Evaluation) -> Result
Three Core Components:
- Datasets: Labeled samples with prompts as input and literal values or scoring guides as targets
- Solvers: Chainable execution units (
generate(),chain_of_thought(),self_critique()) - Scorers: Support for Exact Match, Model Graded, and custom scoring
Key Features#
Agent & Tool Support#
- Tool Calling: Custom Tools, MCP Tools, Bash, Python, Web Search/Browsing, Computer Tools
- Agent Evaluations: Built-in ReAct Agent, Multi-Agent, external agents (Claude Code, Codex CLI, Gemini CLI)
- Sandboxed Execution: Docker, Kubernetes, Modal, Proxmox
- Tool Approval: Fine-grained tool call approval policies
Model Provider Support#
| Type | Providers |
|---|---|
| Commercial APIs | OpenAI, Anthropic, Google, Grok, Mistral, AWS Bedrock, Azure AI, TogetherAI, Groq |
| Local/Open Source | vLLM, Ollama, llama-cpp-python, HuggingFace |
Pre-built Evaluation Library (100+)#
- Safeguards: AgentHarm, StrongREJECT, WMDP
- Coding: HumanEval, SWE-bench, BigCodeBench
- Knowledge: MMLU, GPQA, TruthfulQA
- Mathematics: AIME, GSM8K, MATH
- Reasoning: ARC, BBH, DROP
- Assistants: GAIA, OSWorld, Mind2Web
Developer Tools#
- CLI:
inspect eval,inspect view - Inspect View: Web-based evaluation monitoring and visualization
- VS Code Extension: Evaluation authoring, debugging, and visualization
Installation & Usage#
pip install inspect-ai
export OPENAI_API_KEY=your-key
inspect eval examples/task.py --model openai/gpt-4o
Technical Specifications#
| Attribute | Value |
|---|---|
| Developer | UK AI Security Institute |
| License | MIT License |
| Primary Languages | Python (81%), TypeScript (17.3%) |
| Python Version | >= 3.10 |
| Initial Release | 2024-05 |
Extension Mechanism#
Extensible via Python packages: Elicitation/Scoring techniques, Model APIs, Tool Execution Environments, Storage Platforms