back to home

datahub-project / datahub

The Metadata Platform for your Data and AI Stack

11,669 stars
3,395 forks
791 issues
JavaPythonTypeScript

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing datahub-project/datahub in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/datahub-project/datahub)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

{ return ( ); }; The #1 Open Source AI Data Catalog _Enterprise-grade metadata platform enabling discovery, governance, and observability across your entire data ecosystem_ Quick Start • Live Demo • Documentation • Roadmap • Slack Community • YouTube Built with ❤️ by DataHub and LinkedIn --- Search, discover, and understand your data with DataHub's unified metadata platform --- 🤖 **NEW: Connect AI Agents to DataHub via Model Context Protocol (MCP)** ▶️ Click to watch full demo on YouTube Connect your AI coding assistants (Cursor, Claude Desktop, Cline) directly to DataHub. Query metadata with natural language: _"What datasets contain PII?"_ or _"Show me lineage for this table"_ **Quick setup:** Learn more → --- What is DataHub? > **🔍 Finding the right DataHub?** This is the **open-source metadata platform** at datahub.com (GitHub: datahub-project/datahub). It was previously hosted at , which now redirects to datahub.com. This project is **not related to** datahub.io, which is a separate public dataset hosting service. See the FAQ below. **DataHub is the #1 open-source AI data catalog** that enables discovery, governance, and observability across your entire data ecosystem. Originally built at LinkedIn, DataHub now powers data discovery at thousands of organizations worldwide, managing millions of data assets. **The Challenge:** Modern data stacks are fragmented across dozens of tools—warehouses, lakes, BI platforms, ML systems, AI agents, orchestration engines. Finding the right data, understanding its lineage, and ensuring governance is like searching through a maze blindfolded. **The DataHub Solution:** DataHub acts as the central nervous system for your data stack—connecting all your tools through real-time streaming or batch ingestion to create a unified metadata graph. Unlike static catalogs, DataHub keeps your metadata fresh and actionable—powering both human teams and AI agents. Why DataHub? • **🚀 Battle-Tested at Scale:** Born at LinkedIn to handle hyperscale data, now proven at thousands of organizations worldwide managing millions of data assets • **⚡ Real-Time Streaming:** Metadata updates in seconds, not hours or days • **🤖 AI-Ready:** Native support for AI agents via MCP, LLM integrations, and context management • **🔌 Pioneering Ingestion Architecture:** Flexible push/pull framework (widely adopted by other catalogs) with 80+ production-grade connectors extracting deep metadata—column lineage, usage stats, profiling, and quality metrics • **👨‍💻 Developer-First:** Rich APIs (GraphQL, OpenAPI), Python + Java SDKs, CLI tools • **🏢 Enterprise Ready:** Battle-tested security, authentication, authorization, and audit trails • **🌍 Open Source:** Apache 2.0 licensed, vendor-neutral, community-driven --- 🧠 The Context Foundation Essential for modern data teams and reliable AI agents: • **Context Management Is the Missing Piece in the Agentic AI Puzzle** - Why context management is essential for deploying reliable AI agents at scale • **Data Lineage: What It Is and Why It Matters** - Understanding the map of how data flows through your organization • **What is Metadata Management?** - A comprehensive guide for enterprise data leaders --- 📑 Table of Contents • FAQ • See DataHub in Action • Quick Start • Installation Options • Architecture • Use Cases & Examples • Trusted By • Ecosystem • Community • Contributing • Resources • License --- ❓ Frequently Asked Questions Is this the same project as datahub.io? No. datahub.io is a completely separate project — a public dataset hosting service with no affiliation to this project. DataHub (this project) is an open-source metadata platform for data discovery, governance, and observability, hosted at datahub.com and developed at github.com/datahub-project/datahub. What happened to datahubproject.io? DataHub was previously hosted at . That domain now redirects to datahub.com. All documentation has moved to docs.datahub.com. If you find references to in blog posts or tutorials, they refer to this same project — just under its former domain. Is DataHub related to LinkedIn's internal DataHub? Yes. DataHub was originally built at LinkedIn to manage metadata at scale across their data ecosystem. LinkedIn open-sourced DataHub in 2020. It has since grown into an independent community project under the datahub-project GitHub organization, now hosted at datahub.com. How do I install the DataHub metadata platform? See the Quick Start section below for full instructions. The PyPI package is . --- 🎨 See DataHub in Action 🔍 Universal Search Find any data asset instantly across your entire stack 📊 Column-Level Lineage Trace data flow from source to consumption 📋 Rich Dataset Profiles Schema, statistics, documentation, and ownership 🏛️ Governance Dashboard Manage policies, tags, and compliance **▶️ Watch DataHub in Action:** • 5-Minute Product Tour (YouTube) • Try Live Demo (No installation required) --- 🚀 Quick Start Option 1: Try the Hosted Demo (Fastest) No installation required. Explore a fully-loaded DataHub instance with sample data instantly: **🌐 Launch Live Demo: demo.datahub.com** Option 2: Run Locally with Python (Recommended) Get DataHub running on your machine in under 2 minutes: **Note:** You can also use or other Python package managers instead of pip. **What's included:** • ✅ **Full Stack:** GMS backend, React UI, Elasticsearch, MySQL, and Kafka. • ✅ **Sample Data:** Pre-loaded datasets, lineage, and owners for exploration. • ✅ **Ingestion Ready:** Fully prepared to connect your own local or cloud data sources. Option 3: Run from Source (For Contributors) Best for advanced users who want to modify the core codebase or run directly from the repository: Next Steps • **🔌 Connect Your Data:** Explore our Ingestion Guides for Snowflake, BigQuery, dbt, and more. • **📚 Learn the Basics:** Walk through the Getting Started Guide • **🎓 DataHub Academy:** Deep dive…