An all-in-one data preparation system for LLMs, supporting reproducible operator pipelines for data generation, cleaning, evaluation, and filtering.
Positioning#
DataFlow is an open-source data preparation and training system for LLMs developed by OpenDCAI (Open Data Center AI Team), designed to distill high-quality training data from noisy sources like PDFs, plain text, and low-quality QA pairs to improve LLM performance in vertical domains such as healthcare, finance, law, and academic research.
Core Architecture#
The system adopts a PyTorch-style Pipeline → Operator → Prompt hierarchical architecture:
- Pipeline: Orchestrates execution order of multiple operators and manages data flow.
- Operator: Encapsulates specific data processing tasks with a consistent API.
- Prompt: Underlying prompt templates defining LLM interaction patterns.
Capability Matrix#
Pipeline Orchestration & Operator System#
- Built-in 10+ core operators defining interaction patterns and 100+ pipeline-specific operators covering generation, evaluation, filtering, and refinement.
- Supports fully custom plug-and-play operators distributable via GitHub or PyPI.
- Data governance algorithms encapsulated as operator pipelines for fair strategy comparison; easily swap underlying LLMs to analyze data quality vs. model performance.
Data Synthesis & Cleaning Workflows#
- Multi-type data generation: Text, math, and code data generation (validated via DataFlow-Instruct-10K dataset).
- Tool-driven generation: Integrates AgenticRAG, Text2SQL and other tools (Text2SQL workflow accepted by ICDE 2026, math data workflow by KDD 2026).
- Document structured extraction: Large-scale PDF → QA conversion, book PDF → visual QA conversion.
DataFlow Suite Components#
| Component | Function |
|---|---|
| DataFlow-WebUI | Visual drag-and-drop pipeline building & management (Vue.js frontend + FastAPI backend) |
| DataFlow-Agent | AI-driven assistant for auto-composing and optimizing operators via natural language |
| DataFlow-Ecosystem | Modular distribution layer with standardized operator registration, supporting domain extensions (e.g., DataFlow-MM, DataFlow-AI4S) |
| RayOrch | Ray-based high-performance distributed computing orchestration layer |
Typical Use Cases#
- LLM pre-training data preparation: Extract and filter high-quality pre-training corpus from raw text.
- SFT data synthesis: Automatically generate high-quality instruction-response pairs.
- RL training data preparation: Provide high-quality data for reinforcement learning.
- RAG system data construction: Extract structured knowledge from PDFs/documents.
- Domain-specific data preparation: Healthcare, finance, law, academic research.
- Math/code data augmentation: Specialized pipelines for math reasoning and code generation.
- Text2SQL data augmentation: SQL-aware data augmentation framework (+3% execution accuracy).
- Enterprise data governance: Traceable, manageable data governance workflows based on Git ecosystem.
Installation#
pip (recommended):
pip install uv
uv pip install open-dataflow
For local GPU inference (vLLM): uv pip install open-dataflow[vllm]
Verify: dataflow -v
Docker:
docker pull molyheci/dataflow:cu124
docker run --gpus all -it molyheci/dataflow:cu124
WebUI: dataflow webui (opens http://localhost:8000/)
Key Configuration#
- LLM backend: Configure any OpenAI-compatible API via
api_url; supports local vLLM inference. - Key management: API keys injected via
DF_API_KEYenvironment variable. - Data formats: Native support for JSON, JSONL, CSV input/output.
- Platforms: Windows, Linux, macOS (Python 3.10/3.11/3.12).
Unconfirmed Information#
- Domain extension modules (DataFlow-MM, DataFlow-AI4S) lack specific repository links.
- DataFlow-Agent and RayOrch independent repos/docs links are unconfirmed.
- DataFlow-Instruct-10K dataset lacks download or HuggingFace hosting link.
- Release date noted as 2025-06-28 may conflict with referenced conference timelines; pending verification.