Best Open Source big data Libraries
A curated list of the most popular GitHub repositories tagged with big data. Select any project to visualize its architecture and dive into the codebase using RepoMind's AI engine.
#1binhnguyennus/awesome-scalability
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
#2ClickHouse/ClickHouse
ClickHouse® is a real-time analytics database management system
#3apache/spark
Apache Spark - A unified analytics engine for large-scale data processing
#4donnemartin/data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
#5apache/flink
Apache Flink
#6thingsboard/thingsboard
Open-source IoT Platform - Device management, data collection, processing and visualization.
#7amark/gun
An open source cybersecurity protocol for syncing decentralized graph data.
#8andkret/Cookbook
The Data Engineering Cookbook
#9trinodb/trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
#10apache/predictionio
PredictionIO, a machine learning server for developers and ML engineers.
#11catboost/catboost
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
#12apache/couchdb
Seamless multi-primary syncing database with an intuitive HTTP/JSON API, designed for reliability
#13vespa-engine/vespa
AI + Data, online. https://vespa.ai
#14feast-dev/feast
The Open Source Feature Store for AI/ML
#15apache/zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
#16hazelcast/hazelcast
Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.