A benchmark platform featuring 100 PhD-level research tasks across 22 distinct fields, systematically evaluating Deep Research Agents (DRAs) on report generation quality and information retrieval capabilities.
One-Minute Overview#
DeepResearch Bench is a comprehensive benchmark platform specifically designed to evaluate Deep Research Agents (DRAs), featuring 100 PhD-level research tasks across 22 distinct fields including science, business, and software. It utilizes two complementary evaluation methodologies - RACE (Reference-based Adaptive Criteria-driven Evaluation) and FACT (Framework for Factual Abundance and Citation Trustworthiness) - to comprehensively assess the performance of deep research agents.
Core Value: Provides objective, systematic evaluation standards to drive innovation and progress in deep research agent technology.
Quick Start#
Installation Difficulty: Medium - Requires API key configuration and environment setup
git clone https://github.com/Ayanami0730/deep_research_bench.git
cd deep_research_bench
pip install -r requirements.txt
Is this suitable for my scenario?
- ✅ Research Institutions: Need systematic evaluation of AI research agent performance
- ✅ AI Developers: Developing and optimizing deep research models
- ❌ Beginners: Need simpler evaluation tools with detailed tutorials
- ❌ Commercial Applications: Need solutions ready for direct integration into products
Core Capabilities#
1. RACE Evaluation System - Comprehensive Report Quality Assessment#
- Evaluates research report quality across four dimensions: comprehensiveness, insight/depth, instruction-following, and readability
- Uses dynamic criteria generation and reference-based scoring for accurate and discriminative evaluation
Actual Value: Helps developers understand their model's strengths and weaknesses across critical research dimensions, enabling targeted optimization.
2. FACT Evaluation Framework - Verifying Information Credibility#
- Automatically extracts factual claims and their cited sources from reports
- Verifies whether cited sources actually support the claims, calculating citation accuracy
Actual Value: Ensures model-generated content is based on reliable information, improving the credibility and practical value of research outcomes.
Technology Stack & Integration#
Development Language: Python Key Dependencies: Gemini API (for LLM evaluation), Jina API (for web scraping) Integration Method: API / Library
Maintenance Status#
- Development Activity: Active with regular updates of evaluation results and new features
- Recent Updates: Recently active with frequent updates, adding evaluation results for multiple new models
- Community Response: Proactively engaged, establishing partnership with AGI-Eval platform, regularly updating the leaderboard
Commercial & Licensing#
License: MIT
- ✅ Commercial Use: Allowed
- ✅ Modification: Allowed
- ⚠️ Restrictions: Must include attribution to original authors
Documentation & Learning Resources#
- Documentation Quality: Comprehensive
- Official Documentation: https://github.com/Ayanami0730/deep_research_bench
- Sample Code: Complete examples and running scripts provided
- Evaluation Examples: Includes detailed evaluation results and comparative analysis