An AI-driven multi-agent research assistant based on LangGraph that automates the entire research workflow from hypothesis generation, data analysis, and visualization to comprehensive report writing.
Project Overview#
DATAGEN (formerly AI-Data-Analysis-MultiAgent) is an enterprise-grade platform for automated data analysis. It leverages multiple specialized AI agents working collaboratively to simulate and execute core tasks of human researchers, achieving end-to-end automation from raw data to insight reports.
Core Architecture#
Progressive Disclosure Architecture#
The system employs an innovative three-level loading strategy to address context overflow in multi-agent long-horizon tasks:
- Level 1 (Metadata): Loads only Agent name, description, and available skills list (~100 tokens) for routing decisions
- Level 2 (Instructions): Loads complete System Prompt (AGENT.md) and global rules when Agent is activated
- Level 3 (Resources): Loads detailed skill documentation (SKILL.md), MCP resources, and external files only during actual execution
Agent Specialization System#
The system features 9 specialized agents:
| Agent | Responsibility |
|---|---|
process_agent | Supervises and orchestrates the entire research workflow |
hypothesis_agent | Automatically generates and refines research hypotheses |
search_agent / searcher_agent | Executes web and literature searches |
code_agent | Writes and executes data analysis code |
visualization_agent | Generates interactive data visualizations |
report_agent | Drafts research reports |
quality_review_agent | Performs quality reviews on analysis processes and results |
note_agent | Handles state tracking and context retention throughout |
refiner_agent | Polishes and optimizes the final report |
Multi-Model Support#
Supports assigning different underlying LLMs to different Agents:
- OpenAI: GPT series
- Anthropic: Claude series
- Google: Gemini series
- Groq: High-performance inference
- Ollama: Local model support
Core Capabilities#
Research Automation#
- AI-driven hypothesis generation and validation
- Automated research direction optimization
- Real-time hypothesis refinement
Data Processing#
- Robust data cleaning and transformation
- Scalable analysis pipelines
- Automated quality assurance
Visualization & Reporting#
- Interactive data visualization
- Custom report generation
- Automated insight extraction
Smart Memory Management#
- Note Taker agent for state tracking
- Efficient context retention system
Quick Start#
Requirements#
- Python 3.10+
- Conda (recommended)
- ChromeDriver (for web automation search)
Installation#
# Clone repository
git clone https://github.com/starpig1129/DATAGEN.git
# Create Conda environment
conda create -n datagen python=3.10
conda activate datagen
# Install dependencies
pip install -r requirements.txt
Configuration#
- Rename
.env Exampleto.env - Configure required items:
WORKING_DIRECTORY,CONDA_ENV,CHROMEDRIVER_PATH - Configure API Keys (as needed):
OPENAI_API_KEY,GOOGLE_API_KEY,ANTHROPIC_API_KEY, etc.
Usage Example#
user_input = '''
datapath:YourDataName.csv
Use machine learning to perform data analysis and write complete graphical reports
'''
Configuration File Structure#
config/agent_models.yaml— Agent model configurationconfig/agents/{agent_name}/AGENT.md— System promptsconfig/agents/{agent_name}/config.yaml— Tools, skills, MCP settingsconfig/skills/{skill-name}/SKILL.md— Reusable skillsconfig/mcp.yaml— MCP server global configuration
Use Cases#
- Data Science & Exploratory Data Analysis (EDA)
- Academic Research Assistance (hypothesis validation & literature review)
- Automated Business Analysis Report Generation
- Complex task orchestration with multi-model collaboration
Important Notes#
- Ensure sufficient API balance; the system makes multiple API calls
- The entire research workflow may take considerable time depending on task complexity
- Recommend backing up data before use; agent system may modify analyzed data