DataFlow

An all-in-one data preparation system for LLMs, supporting reproducible operator pipelines for data generation, cleaning, evaluation, and filtering.

Positioning#

DataFlow is an open-source data preparation and training system for LLMs developed by OpenDCAI (Open Data Center AI Team), designed to distill high-quality training data from noisy sources like PDFs, plain text, and low-quality QA pairs to improve LLM performance in vertical domains such as healthcare, finance, law, and academic research.

Core Architecture#

The system adopts a PyTorch-style Pipeline → Operator → Prompt hierarchical architecture:

Pipeline: Orchestrates execution order of multiple operators and manages data flow.
Operator: Encapsulates specific data processing tasks with a consistent API.
Prompt: Underlying prompt templates defining LLM interaction patterns.

Capability Matrix#

Pipeline Orchestration & Operator System#

Built-in 10+ core operators defining interaction patterns and 100+ pipeline-specific operators covering generation, evaluation, filtering, and refinement.
Supports fully custom plug-and-play operators distributable via GitHub or PyPI.
Data governance algorithms encapsulated as operator pipelines for fair strategy comparison; easily swap underlying LLMs to analyze data quality vs. model performance.

Data Synthesis & Cleaning Workflows#

Multi-type data generation: Text, math, and code data generation (validated via DataFlow-Instruct-10K dataset).
Tool-driven generation: Integrates AgenticRAG, Text2SQL and other tools (Text2SQL workflow accepted by ICDE 2026, math data workflow by KDD 2026).
Document structured extraction: Large-scale PDF → QA conversion, book PDF → visual QA conversion.

DataFlow Suite Components#

Component	Function
DataFlow-WebUI	Visual drag-and-drop pipeline building & management (Vue.js frontend + FastAPI backend)
DataFlow-Agent	AI-driven assistant for auto-composing and optimizing operators via natural language
DataFlow-Ecosystem	Modular distribution layer with standardized operator registration, supporting domain extensions (e.g., DataFlow-MM, DataFlow-AI4S)
RayOrch	Ray-based high-performance distributed computing orchestration layer

Typical Use Cases#

LLM pre-training data preparation: Extract and filter high-quality pre-training corpus from raw text.
SFT data synthesis: Automatically generate high-quality instruction-response pairs.
RL training data preparation: Provide high-quality data for reinforcement learning.
RAG system data construction: Extract structured knowledge from PDFs/documents.
Domain-specific data preparation: Healthcare, finance, law, academic research.
Math/code data augmentation: Specialized pipelines for math reasoning and code generation.
Text2SQL data augmentation: SQL-aware data augmentation framework (+3% execution accuracy).
Enterprise data governance: Traceable, manageable data governance workflows based on Git ecosystem.

Installation#

pip (recommended):

pip install uv
uv pip install open-dataflow

For local GPU inference (vLLM): uv pip install open-dataflow[vllm] Verify: dataflow -v

Docker:

docker pull molyheci/dataflow:cu124
docker run --gpus all -it molyheci/dataflow:cu124

WebUI: dataflow webui (opens http://localhost:8000/)

Key Configuration#

LLM backend: Configure any OpenAI-compatible API via api_url; supports local vLLM inference.
Key management: API keys injected via DF_API_KEY environment variable.
Data formats: Native support for JSON, JSONL, CSV input/output.
Platforms: Windows, Linux, macOS (Python 3.10/3.11/3.12).

Unconfirmed Information#

Domain extension modules (DataFlow-MM, DataFlow-AI4S) lack specific repository links.
DataFlow-Agent and RayOrch independent repos/docs links are unconfirmed.
DataFlow-Instruct-10K dataset lacks download or HuggingFace hosting link.
Release date noted as 2025-06-28 may conflict with referenced conference timelines; pending verification.

Positioning#

Core Architecture#

Capability Matrix#

Pipeline Orchestration & Operator System#

Data Synthesis & Cleaning Workflows#

DataFlow Suite Components#

Typical Use Cases#

Installation#

Key Configuration#

Unconfirmed Information#

Related Projects

Basic Memory

vfs (Virtual Function Signatures)

RexCLI

STAY UPDATED