DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

A benchmark platform featuring 100 PhD-level research tasks across 22 distinct fields, systematically evaluating Deep Research Agents (DRAs) on report generation quality and information retrieval capabilities.

One-Minute Overview#

DeepResearch Bench is a comprehensive benchmark platform specifically designed to evaluate Deep Research Agents (DRAs), featuring 100 PhD-level research tasks across 22 distinct fields including science, business, and software. It utilizes two complementary evaluation methodologies - RACE (Reference-based Adaptive Criteria-driven Evaluation) and FACT (Framework for Factual Abundance and Citation Trustworthiness) - to comprehensively assess the performance of deep research agents.

Core Value: Provides objective, systematic evaluation standards to drive innovation and progress in deep research agent technology.

Quick Start#

Installation Difficulty: Medium - Requires API key configuration and environment setup

git clone https://github.com/Ayanami0730/deep_research_bench.git
cd deep_research_bench
pip install -r requirements.txt

Is this suitable for my scenario?

✅ Research Institutions: Need systematic evaluation of AI research agent performance

✅ AI Developers: Developing and optimizing deep research models

❌ Beginners: Need simpler evaluation tools with detailed tutorials

❌ Commercial Applications: Need solutions ready for direct integration into products

Core Capabilities#

1. RACE Evaluation System - Comprehensive Report Quality Assessment#

Evaluates research report quality across four dimensions: comprehensiveness, insight/depth, instruction-following, and readability
Uses dynamic criteria generation and reference-based scoring for accurate and discriminative evaluation

Actual Value: Helps developers understand their model's strengths and weaknesses across critical research dimensions, enabling targeted optimization.

2. FACT Evaluation Framework - Verifying Information Credibility#

Automatically extracts factual claims and their cited sources from reports
Verifies whether cited sources actually support the claims, calculating citation accuracy

Actual Value: Ensures model-generated content is based on reliable information, improving the credibility and practical value of research outcomes.

Technology Stack & Integration#

Development Language: Python Key Dependencies: Gemini API (for LLM evaluation), Jina API (for web scraping) Integration Method: API / Library

Maintenance Status#

Development Activity: Active with regular updates of evaluation results and new features
Recent Updates: Recently active with frequent updates, adding evaluation results for multiple new models
Community Response: Proactively engaged, establishing partnership with AGI-Eval platform, regularly updating the leaderboard

Commercial & Licensing#

License: MIT

✅ Commercial Use: Allowed
✅ Modification: Allowed
⚠️ Restrictions: Must include attribution to original authors

Documentation & Learning Resources#

Documentation Quality: Comprehensive
Official Documentation: https://github.com/Ayanami0730/deep_research_bench
Sample Code: Complete examples and running scripts provided
Evaluation Examples: Includes detailed evaluation results and comparative analysis

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

One-Minute Overview#

Quick Start#

Core Capabilities#

1. RACE Evaluation System - Comprehensive Report Quality Assessment#

2. FACT Evaluation Framework - Verifying Information Credibility#

Technology Stack & Integration#

Maintenance Status#

Commercial & Licensing#

Documentation & Learning Resources#

Related Projects

oh-my-codex

Ironcurtain

vibe-remote

STAY UPDATED