OpenDCAI / DataFlow

Easy Data Preparation with latest LLMs-based Operators and Pipelines.

3,104 stars

227 forks

11 issues

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing OpenDCAI/DataFlow in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/OpenDCAI/DataFlow)

Preview:

Repository Overview (README excerpt)

Crawler view

DataFlow **Generate, Clean, and Prepare LLM Data, All-in-One** Visual, low-code pipelines with flexible orchestration across domains and use cases.💪 Turn raw data into high-quality LLM training datasets.🔧 🎉 Get smarter LLMs cheaply — give us a star ⭐ on GitHub for the latest update. **Beginner-friendly learning resources (continuously updated)**: [[🎬 Video Tutorials]](https://space.bilibili.com/3546929239689711?spm_id_from=333.337.0.0) [[📚 Written Tutorials]](https://wcny4qa9krto.feishu.cn/wiki/I9tbw2qnBi0lEakmmAGclTysnFd) 简体中文 | English --> 📰 0. News • **[2026-02-02] 🖥️ DataFlow WebUI is now available!** Launch the visual pipeline builder with a single command: . Build and run DataFlow pipelines through an intuitive web interface. 👉 WebUI Docs • **[2026-01-20] 🌟 Awesome Works Using DataFlow is now live!** A new section showcasing open-source projects and research built on DataFlow. Contributions are welcome! 👉 Awesome Works • **[2025-12-19] 🎉 Our DataFlow technical report is now available!** Read and cite our work on arXiv: https://arxiv.org/abs/2512.16676 • **[2025-11-20] 🤖 Introducing New Data Agents for DataFlow!** Try them out and follow the tutorial on Bilibili: https://space.bilibili.com/3546929239689711/lists/6761342?type=season • **[2025-06-28] 🎉 DataFlow is officially released!** Our data-centric AI system is now public. Stay tuned for future updates. 🔍 1. What is DataFlow？ --> DataFlow is a data preparation and training system designed to **generate, refine, evaluate, and filter** high-quality data for AI from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuning, RL training) or RAG system, in domains such as healthcare, finance, legal, and academic research. Through an design, DataFlow turns the entire data cleaning workflow into a reproducible, reusable, and shareable , providing core infrastructure for the Data-Centric AI community. Additionally, we develop an intelligent capable of dynamically assembling new by recombining existing or creating new on demand. 🔍 2. Key Features ✅2.1 Ready-to-Use Data Synthesis and Cleaning Pipelines • High-Quality Training Data Generation • Text, Math, and Code data generation (see DataFlow-Instruct-10K for results) • Data generation via tools like AgenticRAG and Text2SQL • Structured Data Extraction • Large-scale PDF → QA conversion • Large-scale book PDF → Visual-QA conversion • Scientific Data Workflow Management • Text2SQL workflow management (Accepted by ICDE 2026) • Math data workflows (Accepted by KDD 2026) ⚙️2.2 Flexible Custom Pipeline Orchestration • 10+ core operators define interaction patterns and design principles • 100+ pipeline-specific operators available for reuse or reference • Full support for creating custom operators — plug-and-play, easily packaged and distributed via GitHub or PyPI 🧠2.3 Reproducible, Reusable, and Shareable Data-Centric AI System • Data governance algorithms are encapsulated as operator pipelines, enabling reproducibility and fair comparison of different data governance strategies (❤️research-friendly) • Easily reuse swap underlying large models to analyze the relationship between model performance and data quality quickly • Built on Python and Git ecosystems for easy distribution, management, and traceability of high-quality, **user-defined** data governance operators and pipelines (❤️enterprise-friendly) 🛠️ 3. DataFlow Suite The DataFlow Suite provides the essential infrastructure to automate and scale LLM data preparation with DataFlow main repository. It comprises four tightly integrated layers: • DataFlow-WebUI – An intuitive, visual interface for constructing and managing complex data pipelines through a drag-and-drop operator workflow. • DataFlow-Agent – An AI-powered assistant that dynamically composes, executes, and optimizes operators and pipelines based on high-level user intent. • DataFlow-Ecosystem – A modular distribution layer that standardizes operator registration. It enables domain-specific modules (e.g., DataFlow-MM, DataFlow-AI4S) to contribute extensible libraries under a unified abstraction. • RayOrch – A high-performance orchestration layer built on Ray, providing distributed compute scheduling and resource management for massive-scale data tasks. Together, these components form a unified, extensible environment that transforms raw data into model-ready intelligence. ✅ 4. Why use DataFlow? Data generation and cleaning are crucial for high-quality models, but for both enterprises and individuals, these tasks are often time-consuming, labor-intensive, and costly. **DataFlow provides a one-stop solution to tackle these challenges efficiently.** Compared with systems like Nemo-Curator and Data-Juicer, DataFlow offers: • **Enhanced Support for Data Synthesis Modules** – Seamlessly integrates text, code, and math data generation pipeline for high-quality training datasets. • **PyTorch-like Programming Management** – Clear **Pipeline → Operator → Prompt** hierarchical structure for workflow control. • **Principled and Multi-Category Operator Classification** – Operators are systematically organized into multiple functional categories such as **generation, evaluation, filtering, and refinement**, forming a scientifically grounded, multi-dimensional taxonomy that reflects different stages of data preparation and enables precise operator selection and composition. • **User-Friendly Design for Easy Debugging and Onboarding** – Simplified workflow patterns that reduce the learning curve and accelerate experimentation. 🔧 5. How do operators work？ DataFlow operators are designed with **simplicity and clarity** in mind. Operators take structured inputs (JSON, JSONL, CSV) and produce high-quality outputs after intelligent processing. Each operator encapsulates a specific data processing task, providing a c…