Best Open Source spark Libraries

A curated list of the most popular GitHub repositories tagged with spark. Select any project to visualize its architecture and dive into the codebase using RepoMind's AI engine.

#1apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

42,851Scala

Analyze Code

#2DataTalksClub/data-engineering-zoomcamp

Data Engineering Zoomcamp is a free 9-week course on building production-ready data pipelines. The next cohort starts in January 2026. Join the course here 👇🏼

38,593Jupyter Notebook

Analyze Code

#3donnemartin/data-science-ipython-notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

28,880Python

Analyze Code

#4getredash/redash

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

28,236Python

Analyze Code

#5yeasy/docker_practice

最新Docker容器技术，从真实案例中学习最佳实践技巧！| Learn and understand Docker&Container technologies, with real DevOps practice!

25,835Go

Analyze Code

#6aalansehaiyang/technology-talk

【大厂面试专栏】一份Java程序员需要的技术指南，这里有面试题、系统架构、职场锦囊、主流中间件等，让你成为更牛的自己！

14,695

Analyze Code

#7horovod/horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

14,673Python

Analyze Code

#8deeplearning4j/deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...

14,205Java

Analyze Code

#9tobymao/sqlglot

Python SQL Parser and Transpiler

8,947Python

Analyze Code

#10mage-ai/mage-ai

🧙 Build, run, and manage data pipelines for integrating and transforming data.

8,651Python

Analyze Code

#11Angel-ML/angel

A Flexible and Powerful Parameter Server for large-scale machine learning

6,788Java

Analyze Code

#12apache/zeppelin

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

6,606Java

Analyze Code