Best Open Source big data Libraries

A curated list of the most popular GitHub repositories tagged with big data. Select any project to visualize its architecture and dive into the codebase using RepoMind's AI engine.

#1binhnguyennus/awesome-scalability

The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

69,462

Explore Repo

#2ClickHouse/ClickHouse

ClickHouse® is a real-time analytics database management system

46,366C++

Explore Repo

#3apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

42,993Scala

Explore Repo

#4donnemartin/data-science-ipython-notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

28,928Python

Explore Repo

#5apache/flink

Apache Flink

25,871Java

Explore Repo

#6amark/gun

An open source cybersecurity protocol for syncing decentralized graph data.

18,968JavaScript

Explore Repo

#7heibaiying/BigData-Notes

大数据入门指南 :star:

16,883Java

Explore Repo

#8prestodb/presto

The official home of the Presto distributed SQL query engine for big data

16,668Java

Explore Repo

#9andkret/Cookbook

The Data Engineering Cookbook

14,995Python

Explore Repo

#10trinodb/trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

12,928Java

Explore Repo

#11StarRocks/starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

11,788Java

Explore Repo

#12apache/datafusion

Apache DataFusion SQL Query Engine

8,877Rust

Explore Repo

#13catboost/catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

8,845C++

Explore Repo

#14delta-io/delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

8,635Scala

Explore Repo

#15crate/crate

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.

4,409Java

Explore Repo

#16apache/paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

3,299Java

Explore Repo

#17ytsaurus/ytsaurus

YTsaurus is a scalable and fault-tolerant open-source big data platform.

2,195C++

Explore Repo

#18uxlfoundation/scikit-learn-intelex

Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

1,350Python

Explore Repo

#19apache/ozone

Scalable, reliable, distributed storage system optimized for data analytics and object store workloads.

1,218Java

Explore Repo

#20caioricciuti/ch-ui

Use CH-UI to work with your data from Click House self-hosted with a user-friendly interface. CH-UI is a modern and feature-rich user interface for ClickHouse databases. It offers an intuitive platform for querying ClickHouse databases, executing queries, and visualizing metrics about your instance.

673Go

Explore Repo