Best Open Source spark Libraries
A curated list of the most popular GitHub repositories tagged with spark. Select any project to visualize its architecture and dive into the codebase using RepoMind's AI engine.
#1apache/spark
Apache Spark - A unified analytics engine for large-scale data processing
#2DataTalksClub/data-engineering-zoomcamp
Data Engineering Zoomcamp is a free 9-week course on building production-ready data pipelines. The next cohort starts in January 2026. Join the course here 👇🏼
#3donnemartin/data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
#4getredash/redash
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
#5yeasy/docker_practice
最新Docker容器技术,从真实案例中学习最佳实践技巧!| Learn and understand Docker&Container technologies, with real DevOps practice!
#6aalansehaiyang/technology-talk
【大厂面试专栏】一份Java程序员需要的技术指南,这里有面试题、系统架构、职场锦囊、主流中间件等,让你成为更牛的自己!
#7horovod/horovod
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
#8deeplearning4j/deeplearning4j
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...
#9tobymao/sqlglot
Python SQL Parser and Transpiler
#10mage-ai/mage-ai
🧙 Build, run, and manage data pipelines for integrating and transforming data.
#11Angel-ML/angel
A Flexible and Powerful Parameter Server for large-scale machine learning
#12apache/zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.