back to home

Best Open Source big data Libraries

A curated list of the most popular GitHub repositories tagged with big data. Select any project to visualize its architecture and dive into the codebase using RepoMind's AI engine.

#1binhnguyennus/awesome-scalability

The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

69,462
Explore Repo

#2ClickHouse/ClickHouse

ClickHouse® is a real-time analytics database management system

46,366C++
Explore Repo

#3apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

42,993Scala
Explore Repo

#4donnemartin/data-science-ipython-notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

28,928Python
Explore Repo

#5apache/flink

Apache Flink

25,871Java
Explore Repo

#6amark/gun

An open source cybersecurity protocol for syncing decentralized graph data.

18,968JavaScript
Explore Repo

#7heibaiying/BigData-Notes

大数据入门指南 :star:

16,883Java
Explore Repo

#8prestodb/presto

The official home of the Presto distributed SQL query engine for big data

16,668Java
Explore Repo

#9andkret/Cookbook

The Data Engineering Cookbook

14,995Python
Explore Repo

#10StarRocks/starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

11,639Java
Explore Repo

#11catboost/catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

8,845C++
Explore Repo

#12apache/datafusion

Apache DataFusion SQL Query Engine

8,710Rust
Explore Repo

#13delta-io/delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

8,635Scala
Explore Repo

#14feast-dev/feast

The Open Source Feature Store for AI/ML

7,000Python
Explore Repo

#15vespa-engine/vespa

AI + Data, online. https://vespa.ai

6,904Java
Explore Repo

#16ytsaurus/ytsaurus

YTsaurus is a scalable and fault-tolerant open-source big data platform.

2,158C++
Explore Repo

#17apache/ozone

Scalable, reliable, distributed storage system optimized for data analytics and object store workloads.

1,203Java
Explore Repo

#18apache/flink-kubernetes-operator

Apache Flink Kubernetes Operator

1,007Java
Explore Repo

#19Lonero-Team/Decentralized-Internet

A SDK/library for decentralized web and distributing computing projects

520JavaScript
Explore Repo

#20drshahizan/HPDP

High performance data processing employs high performance computing (HPC) to process data, which is then translated into information and knowledge. The advent of high-performance computing and data analytics enabled real-time interrogation of extremely large data sets.

153Jupyter Notebook
Explore Repo