Best Open Source big data Libraries
A curated list of the most popular GitHub repositories tagged with big data. Select any project to visualize its architecture and dive into the codebase using RepoMind's AI engine.
#1binhnguyennus/awesome-scalability
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
#2ClickHouse/ClickHouse
ClickHouse® is a real-time analytics database management system
#3apache/spark
Apache Spark - A unified analytics engine for large-scale data processing
#4donnemartin/data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
#5apache/flink
Apache Flink
#6amark/gun
An open source cybersecurity protocol for syncing decentralized graph data.
#7heibaiying/BigData-Notes
大数据入门指南 :star:
#8prestodb/presto
The official home of the Presto distributed SQL query engine for big data
#9andkret/Cookbook
The Data Engineering Cookbook
#10trinodb/trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
#11StarRocks/starrocks
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
#12apache/datafusion
Apache DataFusion SQL Query Engine
#13catboost/catboost
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
#14delta-io/delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
#15crate/crate
CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.
#16apache/paimon
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
#17ytsaurus/ytsaurus
YTsaurus is a scalable and fault-tolerant open-source big data platform.
#18uxlfoundation/scikit-learn-intelex
Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application
#19apache/ozone
Scalable, reliable, distributed storage system optimized for data analytics and object store workloads.
#20caioricciuti/ch-ui
Use CH-UI to work with your data from Click House self-hosted with a user-friendly interface. CH-UI is a modern and feature-rich user interface for ClickHouse databases. It offers an intuitive platform for querying ClickHouse databases, executing queries, and visualizing metrics about your instance.