An AI-powered data science team of agents that automates data loading, cleaning, feature engineering, EDA, visualization, and machine learning modeling (H2O + MLflow) through specialized agent collaboration, featuring a Streamlit visual pipeline studio to perform common data science tasks 10X faster.
Overview#
AI Data Science Team is a Python library designed to build a virtual AI data science team. It leverages Large Language Models (LLM) to drive multiple specialized Agents, automating the entire workflow from data loading, cleaning, wrangling, EDA to machine learning modeling.
Core Capabilities#
Data Processing#
- Data Loading & Inspection: Support for common formats (CSV, Excel, etc.)
- Data Cleaning: Automatically handle missing values, outliers, duplicates
- Data Wrangling: Format conversion, pivot tables, merging, etc.
Feature & Analysis#
- Feature Engineering: Automatic feature generation/selection
- EDA (Exploratory Data Analysis): Automatic statistical summaries and charts
- Visualization: Code-generated charting capabilities
Data Source Interaction#
- SQL Interaction: Natural language to SQL queries, database interaction
Modeling Capabilities#
- H2O AutoML: Integrated H2O for automated modeling
- MLflow Integration: Experiment tracking and model management
- Model Evaluation: Automated evaluation metrics generation
Agent Module System#
Base Agents (agents/)#
data_loader_tools_agent: Data loadingdata_cleaning_agent: Data cleaningdata_wrangling_agent: Data wranglingdata_visualization_agent: Visualizationfeature_engineering_agent: Feature engineeringsql_database_agent: SQL database operationsworkflow_planner_agent: Workflow planning
Data Science Agents (ds_agents/)#
eda_tools_agent: Focused on EDA toolchain
Machine Learning Agents (ml_agents/)#
h2o_ml_agent: Execute H2O machine learning tasksmlflow_tools_agent: Manage MLflow toolsmodel_evaluation_agent: Focused on model evaluation
Multi-Agent System (multiagents/)#
pandas_data_analyst: Pandas data analysis expertsql_data_analyst: SQL data analysis expertsupervisor_ds_team: Supervisor Agent coordinating other Agents
Flagship Application: AI Pipeline Studio#
An interactive application built on Streamlit as the graphical frontend:
- Pipeline-first Workspace: Integrated visual editor, table viewer, chart generator, and code viewer
- Hybrid Mode: Support for manual and AI automated steps
- Project Management: Save projects (metadata-only or full-data), support rehydrate (reload from source data)
- Context Memory: Short-term memory for multi-turn conversation context
- Debugging: Verbose logs mode, output to logs/ directory
Architecture#
ai_data_science_team/
├── orchestration.py # Orchestration logic (core flow control)
├── agents/ # Base data science agents
├── ds_agents/ # Extended DS agents
├── ml_agents/ # Extended ML agents
├── multiagents/ # Multi-agent collaboration logic
├── parsers/ # Output parsers
├── templates/ # Prompt templates
├── tools/ # Low-level tool functions
└── utils/ # General utility functions
Installation & Quick Start#
Requirements#
- Python 3.10+
- OpenAI API Key (recommended) or locally running Ollama instance
Installation#
# PyPI installation
pip install ai-data-science-team
# Source development installation
git clone https://github.com/business-science/ai-data-science-team.git
cd ai-data-science-team
pip install -e .
Run AI Pipeline Studio#
streamlit run apps/ai-pipeline-studio-app/app.py
LLM Configuration#
OpenAI (Cloud)
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-4.1-mini")
Ollama (Local)
ollama serve
ollama pull llama3.1:8b
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3.1:8b")
Example Resources#
Rich Jupyter Notebook examples provided:
data_cleaning_agent.ipynbdata_loader_tools_agent.ipynbdata_visualization_agent.ipynbdata_wrangling_agent.ipynbfeature_engineering_agent.ipynbsql_database_agent.ipynb
And advanced topic directories: advanced_topics/, ds_agents/, ml_agents/, multiagents/, teams_of_agents/