Best Open Source data engineering Libraries
A curated list of the most popular GitHub repositories tagged with data engineering. Select any project to visualize its architecture and dive into the codebase using RepoMind's AI engine.
#1apache/superset
Apache Superset is a Data Visualization and Data Exploration Platform
#2GokuMohandas/Made-With-ML
Learn how to design, develop, deploy and iterate on production-grade ML applications.
#3apache/airflow
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
#4DataTalksClub/data-engineering-zoomcamp
Data Engineering Zoomcamp is a free 9-week course on building production-ready data pipelines. The next cohort starts in January 2026. Join the course here 👇🏼
#5eugeneyan/applied-ml
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
#6PrefectHQ/prefect
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
#7airbytehq/airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
#8Avaiga/taipy
Turns Data and AI algorithms into production-ready web applications in no time.
#9dagster-io/dagster
An orchestration platform for the development, production, and observation of data assets.
#10andkret/Cookbook
The Data Engineering Cookbook
#11datastacktv/data-engineer-roadmap
Roadmap to becoming a data engineer in 2021
#12xonsh/xonsh
🐚 Python-powered shell. Full-featured, cross-platform and AI-friendly.
#13risingwavelabs/risingwave
Event streaming platform for agents, apps, and analytics. Continuously ingest, transform, and serve event data in real time, at scale.
#14mage-ai/mage-ai
🧙 Build, run, and manage data pipelines for integrating and transforming data.
#15feast-dev/feast
The Open Source Feature Store for AI/ML