back to home

Best Open Source big data Libraries

A curated list of the most popular GitHub repositories tagged with big data. Select any project to visualize its architecture and dive into the codebase using RepoMind's AI engine.

#1binhnguyennus/awesome-scalability

The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

68,736
Analyze Code

#2ClickHouse/ClickHouse

ClickHouse® is a real-time analytics database management system

45,985C++
Analyze Code

#3apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

42,851Scala
Analyze Code

#4donnemartin/data-science-ipython-notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

28,880Python
Analyze Code

#5apache/flink

Apache Flink

25,812Java
Analyze Code

#6thingsboard/thingsboard

Open-source IoT Platform - Device management, data collection, processing and visualization.

21,176Java
Analyze Code

#7amark/gun

An open source cybersecurity protocol for syncing decentralized graph data.

18,911JavaScript
Analyze Code

#8andkret/Cookbook

The Data Engineering Cookbook

14,957Python
Analyze Code

#9trinodb/trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

12,579Java
Analyze Code

#10apache/predictionio

PredictionIO, a machine learning server for developers and ML engineers.

12,533Scala
Analyze Code

#11catboost/catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

8,811C++
Analyze Code

#12apache/couchdb

Seamless multi-primary syncing database with an intuitive HTTP/JSON API, designed for reliability

6,817Erlang
Analyze Code

#13vespa-engine/vespa

AI + Data, online. https://vespa.ai

6,789Java
Analyze Code

#14feast-dev/feast

The Open Source Feature Store for AI/ML

6,727Python
Analyze Code

#15apache/zeppelin

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

6,606Java
Analyze Code

#16hazelcast/hazelcast

Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.

6,596Java
Analyze Code