DISCOVER THE FUTURE OF AI AGENTSarrow_forward

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

calendar_todayAdded Jan 26, 2026
categoryDocs, Tutorials & Resources
codeOpen Source
Python大语言模型Deep LearningAI AgentsWeb ApplicationDocs, Tutorials & ResourcesKnowledge Management, Retrieval & RAGEducation & Research ResourcesModel Training & Inference

A benchmark platform featuring 100 PhD-level research tasks across 22 distinct fields, systematically evaluating Deep Research Agents (DRAs) on report generation quality and information retrieval capabilities.

One-Minute Overview#

DeepResearch Bench is a comprehensive benchmark platform specifically designed to evaluate Deep Research Agents (DRAs), featuring 100 PhD-level research tasks across 22 distinct fields including science, business, and software. It utilizes two complementary evaluation methodologies - RACE (Reference-based Adaptive Criteria-driven Evaluation) and FACT (Framework for Factual Abundance and Citation Trustworthiness) - to comprehensively assess the performance of deep research agents.

Core Value: Provides objective, systematic evaluation standards to drive innovation and progress in deep research agent technology.

Quick Start#

Installation Difficulty: Medium - Requires API key configuration and environment setup

git clone https://github.com/Ayanami0730/deep_research_bench.git
cd deep_research_bench
pip install -r requirements.txt

Is this suitable for my scenario?

  • Research Institutions: Need systematic evaluation of AI research agent performance
  • AI Developers: Developing and optimizing deep research models
  • Beginners: Need simpler evaluation tools with detailed tutorials
  • Commercial Applications: Need solutions ready for direct integration into products

Core Capabilities#

1. RACE Evaluation System - Comprehensive Report Quality Assessment#

  • Evaluates research report quality across four dimensions: comprehensiveness, insight/depth, instruction-following, and readability
  • Uses dynamic criteria generation and reference-based scoring for accurate and discriminative evaluation

Actual Value: Helps developers understand their model's strengths and weaknesses across critical research dimensions, enabling targeted optimization.

2. FACT Evaluation Framework - Verifying Information Credibility#

  • Automatically extracts factual claims and their cited sources from reports
  • Verifies whether cited sources actually support the claims, calculating citation accuracy

Actual Value: Ensures model-generated content is based on reliable information, improving the credibility and practical value of research outcomes.

Technology Stack & Integration#

Development Language: Python Key Dependencies: Gemini API (for LLM evaluation), Jina API (for web scraping) Integration Method: API / Library

Maintenance Status#

  • Development Activity: Active with regular updates of evaluation results and new features
  • Recent Updates: Recently active with frequent updates, adding evaluation results for multiple new models
  • Community Response: Proactively engaged, establishing partnership with AGI-Eval platform, regularly updating the leaderboard

Commercial & Licensing#

License: MIT

  • ✅ Commercial Use: Allowed
  • ✅ Modification: Allowed
  • ⚠️ Restrictions: Must include attribution to original authors

Documentation & Learning Resources#

  • Documentation Quality: Comprehensive
  • Official Documentation: https://github.com/Ayanami0730/deep_research_bench
  • Sample Code: Complete examples and running scripts provided
  • Evaluation Examples: Includes detailed evaluation results and comparative analysis

Related Projects

View All arrow_forward

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.

rocket_launch