DISCOVER THE FUTURE OF AI AGENTSarrow_forward

DataFlow

calendar_todayAdded Apr 22, 2026
categoryModel & Inference Framework
codeOpen Source
PythonWorkflow AutomationDocker大语言模型RAGCLINatural Language ProcessingModel & Inference FrameworkOtherAutomation, Workflow & RPAKnowledge Management, Retrieval & RAGModel Training & Inference

An all-in-one data preparation system for LLMs, supporting reproducible operator pipelines for data generation, cleaning, evaluation, and filtering.

Positioning#

DataFlow is an open-source data preparation and training system for LLMs developed by OpenDCAI (Open Data Center AI Team), designed to distill high-quality training data from noisy sources like PDFs, plain text, and low-quality QA pairs to improve LLM performance in vertical domains such as healthcare, finance, law, and academic research.

Core Architecture#

The system adopts a PyTorch-style Pipeline → Operator → Prompt hierarchical architecture:

  • Pipeline: Orchestrates execution order of multiple operators and manages data flow.
  • Operator: Encapsulates specific data processing tasks with a consistent API.
  • Prompt: Underlying prompt templates defining LLM interaction patterns.

Capability Matrix#

Pipeline Orchestration & Operator System#

  • Built-in 10+ core operators defining interaction patterns and 100+ pipeline-specific operators covering generation, evaluation, filtering, and refinement.
  • Supports fully custom plug-and-play operators distributable via GitHub or PyPI.
  • Data governance algorithms encapsulated as operator pipelines for fair strategy comparison; easily swap underlying LLMs to analyze data quality vs. model performance.

Data Synthesis & Cleaning Workflows#

  • Multi-type data generation: Text, math, and code data generation (validated via DataFlow-Instruct-10K dataset).
  • Tool-driven generation: Integrates AgenticRAG, Text2SQL and other tools (Text2SQL workflow accepted by ICDE 2026, math data workflow by KDD 2026).
  • Document structured extraction: Large-scale PDF → QA conversion, book PDF → visual QA conversion.

DataFlow Suite Components#

ComponentFunction
DataFlow-WebUIVisual drag-and-drop pipeline building & management (Vue.js frontend + FastAPI backend)
DataFlow-AgentAI-driven assistant for auto-composing and optimizing operators via natural language
DataFlow-EcosystemModular distribution layer with standardized operator registration, supporting domain extensions (e.g., DataFlow-MM, DataFlow-AI4S)
RayOrchRay-based high-performance distributed computing orchestration layer

Typical Use Cases#

  • LLM pre-training data preparation: Extract and filter high-quality pre-training corpus from raw text.
  • SFT data synthesis: Automatically generate high-quality instruction-response pairs.
  • RL training data preparation: Provide high-quality data for reinforcement learning.
  • RAG system data construction: Extract structured knowledge from PDFs/documents.
  • Domain-specific data preparation: Healthcare, finance, law, academic research.
  • Math/code data augmentation: Specialized pipelines for math reasoning and code generation.
  • Text2SQL data augmentation: SQL-aware data augmentation framework (+3% execution accuracy).
  • Enterprise data governance: Traceable, manageable data governance workflows based on Git ecosystem.

Installation#

pip (recommended):

pip install uv
uv pip install open-dataflow

For local GPU inference (vLLM): uv pip install open-dataflow[vllm] Verify: dataflow -v

Docker:

docker pull molyheci/dataflow:cu124
docker run --gpus all -it molyheci/dataflow:cu124

WebUI: dataflow webui (opens http://localhost:8000/)

Key Configuration#

  • LLM backend: Configure any OpenAI-compatible API via api_url; supports local vLLM inference.
  • Key management: API keys injected via DF_API_KEY environment variable.
  • Data formats: Native support for JSON, JSONL, CSV input/output.
  • Platforms: Windows, Linux, macOS (Python 3.10/3.11/3.12).

Unconfirmed Information#

  • Domain extension modules (DataFlow-MM, DataFlow-AI4S) lack specific repository links.
  • DataFlow-Agent and RayOrch independent repos/docs links are unconfirmed.
  • DataFlow-Instruct-10K dataset lacks download or HuggingFace hosting link.
  • Release date noted as 2025-06-28 may conflict with referenced conference timelines; pending verification.

Related Projects

View All arrow_forward

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.

rocket_launch