back to home

Best Open Source data engineering Libraries

A curated list of the most popular GitHub repositories tagged with data engineering. Select any project to visualize its architecture and dive into the codebase using RepoMind's AI engine.

#1apache/superset

Apache Superset is a Data Visualization and Data Exploration Platform

70,618TypeScript
Analyze Code

#2GokuMohandas/Made-With-ML

Learn how to design, develop, deploy and iterate on production-grade ML applications.

46,391Jupyter Notebook
Analyze Code

#3apache/airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

44,349Python
Analyze Code

#4DataTalksClub/data-engineering-zoomcamp

Data Engineering Zoomcamp is a free 9-week course on building production-ready data pipelines. The next cohort starts in January 2026. Join the course here 👇🏼

38,593Jupyter Notebook
Analyze Code

#5eugeneyan/applied-ml

📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.

28,694
Analyze Code

#6PrefectHQ/prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

21,650Python
Analyze Code

#7airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

20,743Python
Analyze Code

#8Avaiga/taipy

Turns Data and AI algorithms into production-ready web applications in no time.

19,083Python
Analyze Code

#9dagster-io/dagster

An orchestration platform for the development, production, and observation of data assets.

14,983Python
Analyze Code

#10andkret/Cookbook

The Data Engineering Cookbook

14,957Python
Analyze Code

#11datastacktv/data-engineer-roadmap

Roadmap to becoming a data engineer in 2021

12,745
Analyze Code

#12xonsh/xonsh

🐚 Python-powered shell. Full-featured, cross-platform and AI-friendly.

9,221Python
Analyze Code

#13risingwavelabs/risingwave

Event streaming platform for agents, apps, and analytics. Continuously ingest, transform, and serve event data in real time, at scale.

8,805Rust
Analyze Code

#14mage-ai/mage-ai

🧙 Build, run, and manage data pipelines for integrating and transforming data.

8,651Python
Analyze Code

#15feast-dev/feast

The Open Source Feature Store for AI/ML

6,727Python
Analyze Code