back to home

datajuicer / data-juicer

Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷

View on GitHub
6,109 stars
347 forks
56 issues
PythonC++Shell

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing datajuicer/data-juicer in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/datajuicer/data-juicer)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

Data-Juicer: The Data Operating System for the Foundation Model Era Multimodal | Cloud-Native | AI-Ready | Large-Scale Data-Juicer (DJ) transforms raw data chaos into AI-ready intelligence. It treats data processing as *composable infrastructure*—providing modular building blocks to clean, synthesize, and analyze data across the entire AI lifecycle, unlocking latent value in every byte. Whether you're deduplicating web-scale pre-training corpora, curating agent interaction traces, or preparing domain-specific RAG indices, DJ scales seamlessly from your laptop to thousand-node clusters—no glue code required. > **Alibaba Cloud PAI** has deeply integrated Data-Juicer into its data processing products. See **Quickly submit a DataJuicer job**. --- 🚀 Quick Start **Zero-install exploration**: • JupyterLab Playground with Tutorials • Ask DJ Copilot **Install & run**: **Or compose in Python**: --- ✨ Why Data-Juicer? • Modular & Extensible Architecture • **200+ operators** spanning text, image, audio, video, and multimodal data • **Recipe-first**: Reproducible YAML pipelines you can version, share, and fork like code • **Composable**: Drop in a single operator, chain complex workflows, or orchestrate full pipelines • **Hot-reload**: Iterate on operators without pipeline restarts • Full-Spectrum Data Intelligence • **Foundation Models**: Pre-training, fine-tuning, RL, and evaluation-grade curation • **Agent Systems**: Clean tool traces, structure context, de-identification, and quality gating • **RAG & Analytics**: Extraction, normalization, semantic chunking, deduplication, and data profiling • Production-Ready Performance • **Scale**: Process 70B samples in 2h on 50 Ray nodes (6400 cores) • **Efficiency**: Deduplicate 5TB in 2.8h using 1280 cores • **Optimization**: Automatic OP fusion (2-10x speedup), adaptive parallelism, CUDA acceleration, robustness • **Observability**: Built-in tracing for debugging, auditing, and iterative improvement > *⭐ If Data-Juicer saved you time or improved your data work, please consider starring the repo.* It helps more people discover the project and keeps you notified of new releases and features. --- 📰 News [2026-03-17] Release v1.5.1: LaTeX OPs; Compressed Format Support; Operator Robustness Fixes • 📄 Two new LaTeX-focused mapper OPs shipped, extending data-juicer's document processing capabilities to handle archives and figure contexts. • 🗜️ Compressed dataset format support: files can now be loaded directly, and Ray datasets gain proper support for reading compressed JSON files. • 📚 New documentation added covering cache, export, and tracing workflows to help users better understand and debug data processing pipelines. • 🤖 Major refactor and upgrade of data-juicer-agents completed: The project architecture and CLI/session capabilities were comprehensively redesigned for better maintainability and extensibility. See date-juicer-agents for more details. [2026-02-12] Release v1.5.0: Partitioned Ray Executor, OP-level Env Management, and More Embodied-AI OPs • 🚀 *Enhanced Distributed Execution Framework* -- Introduced partitioned Ray executor and OP-level isolated environments to improve fault tolerance, scalability, and dependency conflict resolution. • 🤖 *Expanded Embodied AI Video Processing* -- Added specialized operators for camera calibration, video undistortion, hand reconstruction, and pose estimation to strengthen multi-view video handling. • 💪🏻 *System Performance & Developer Experience Optimizations* -- Enabled batch inference, memory/log reduction, core logic refactoring, and updated documentation/templates. • 🐳 *Critical Bug Fixes & Stability Improvements* -- Resolved duplicate tracking, parameter conflicts, homepage rendering issues, and outdated docs for higher reliability. [2026-02-02] Release v1.4.6: Copilot, Video Bytes I/O & Ray Tracing • 🤖 *Q&A Copilot* — Now live on our Doc Site | DingTalk | Discord. Feel free to ask anything related to Data-Juicer ecosystem! • Check 🤖 Data-Juicer Agents | 📃 Deploy-ready codes | 🎬 More demos for more details. • 🎬 *Video Bytes I/O* — Direct bytes processing for video pipelines • 🫆 *Ray Mode Tracer* — Track changed samples in distributed processing • 🐳 *Enhancements & fixes* — refreshed Docker image, small perf boosts, GitHub Insights traffic workflow, Ray compatibility updates, and bug/doc fixes. [2026-01-15] Release v1.4.5: 20+ New OPs, Ray vLLM Pipelines & Sphinx Docs Upgrade • *Embodied-AI OPs*: added/enhanced mappers for video captioning (VLM), video object segmentation (YOLOE+SAM2), video depth estimation (viz + point cloud), human pose (MMPose), image tagging (VLM), single-image 3D body mesh recovery (SAM 3D Body), plus *S3 upload/download*. • *New Pipeline OP*: compose multiple OPs into one pipeline; introduced *Ray + vLLM* pipelines for LLM/VLM inference. • *Docs upgrade*: moved to a unified *Sphinx-based* documentation build/deploy workflow with isolated theme/architecture repo. • *Enhancements & fixes*: dependency updates, improved Ray deduplication and S3 loading, OpenAI Responses API support, tracer consistency, Docker base updated to CUDA 12.6.3 + Ubuntu 24.04 + Py3.11, and multiple bug fixes. [2025-12-01] Release v1.4.4: NeurIPS’25 Spotlight, 6 New Video/MM OPs & S3 I/O • NeurIPS'25 **Spotlight** for Data-Juicer 2.0 • *Repo split*: sandbox/recipes/agents moved to standalone repos • *S3 I/O* added to loader/exporter • *6 new video & multimodal OPs* (character detection, VGGT, whole-body pose, hand reconstruction) + docs/Ray/video I/O improvements and bug fixes View All Release and News Archive --- 🔌 Users & Ecosystems > The below list focuses on *developer-facing integration and usages* in *alphabetical order*. > Missing your project / name? Feel free to open a PR or reach out. Data-Juicer plugs into your existing stack and evolves with community contributions: Extensions • **data-juicer-agents** — DJ Copilot and agentic workflows • **data-juicer-hu…