Best Open Source big data Libraries
A curated list of the most popular GitHub repositories tagged with big data. Select any project to visualize its architecture and dive into the codebase using RepoMind's AI engine.
#1binhnguyennus/awesome-scalability
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
#2ClickHouse/ClickHouse
ClickHouse® is a real-time analytics database management system
#3apache/spark
Apache Spark - A unified analytics engine for large-scale data processing
#4donnemartin/data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
#5apache/flink
Apache Flink
#6amark/gun
An open source cybersecurity protocol for syncing decentralized graph data.
#7heibaiying/BigData-Notes
大数据入门指南 :star:
#8prestodb/presto
The official home of the Presto distributed SQL query engine for big data
#9andkret/Cookbook
The Data Engineering Cookbook
#10StarRocks/starrocks
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
#11catboost/catboost
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
#12apache/datafusion
Apache DataFusion SQL Query Engine
#13delta-io/delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
#14feast-dev/feast
The Open Source Feature Store for AI/ML
#15vespa-engine/vespa
AI + Data, online. https://vespa.ai
#16ytsaurus/ytsaurus
YTsaurus is a scalable and fault-tolerant open-source big data platform.
#17apache/ozone
Scalable, reliable, distributed storage system optimized for data analytics and object store workloads.
#18apache/flink-kubernetes-operator
Apache Flink Kubernetes Operator
#19Lonero-Team/Decentralized-Internet
A SDK/library for decentralized web and distributing computing projects
#20drshahizan/HPDP
High performance data processing employs high performance computing (HPC) to process data, which is then translated into information and knowledge. The advent of high-performance computing and data analytics enabled real-time interrogation of extremely large data sets.