back to home

apache / gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.

View on GitHub
1,533 stars
585 forks
873 issues
ScalaC++Java

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing apache/gluten in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/apache/gluten)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

Apache Gluten **A Middle Layer for Offloading JVM-based SQL Engines' Execution to Native Engines** • Introduction Background Apache Spark is a mature and stable project that has been under continuous development for many years. It is one of the most widely used frameworks for scaling out the processing of petabyte-scale datasets. Over time, the Spark community has had to address significant performance challenges, which required a variety of optimizations. A major milestone came with Spark 2.0, where Whole-Stage Code Generation replaced the Volcano Model, delivering up to a 2× speedup. Since then, most subsequent improvements have focused on the query plan level, while the performance of individual operators has almost stopped improving. In recent years, several native SQL engines have been developed, such as ClickHouse and Velox. With features like native execution, columnar data formats, and vectorized data processing, these engines can outperform Spark’s JVM-based SQL engine. However, they currently don't directly support Spark SQL execution. Design Overview “Gluten” is Latin for "glue". The main goal of the Gluten project is to glue native engines to Spark SQL. Thus, we can benefit from the high performance of native engines and the high scalability enabled by the Spark ecosystem. The basic design principle is to reuse Spark’s control flow, while offloading compute-intensive data processing to the native side. More specifically: • Transform Spark’s physical plan to Substrait plan, then transform it to native engine's plan. • Offload performance-critical data processing to native engine. • Define clear JNI interfaces for native SQL engines. • Allow easy switching between available native backends. • Reuse Spark’s distributed control flow. • Manage data sharing between JVM and native. • Provide extensibility to support more native engines. Target Users Gluten's target users include anyone who wants to fundamentally accelerate Spark SQL. As a plugin to Spark, Gluten requires no changes to the DataFrame API or SQL queries; users only need to configure it correctly. • Architecture The overview chart is shown below. Substrait provides a well-defined, cross-language specification for data compute operations. Spark’s physical plan is transformed into a Substrait plan, which is then passed to the native side through a JNI call. On the native side, a chain of native operators is constructed and offloaded to the native engine. Gluten returns the results as a ColumnarBatch, and Spark’s Columnar API (introduced in Spark 3.0) is used during execution. Gluten adopts the Apache Arrow data format as its underlying representation. Currently, Gluten supports only ClickHouse and Velox backends. Velox is a C++ database acceleration library which provides reusable, extensible and high-performance data processing components. In addition, Gluten is designed to be extensible, allowing support for additional backends in the future. Gluten's key components: • **Query Plan Conversion**: Converts Spark's physical plan to Substrait plan. • **Unified Memory Management**: Manages native memory allocation. • **Columnar Shuffle**: Handles shuffling of Gluten's columnar data. The shuffle service of Spark core is reused, while a columnar exchange operator is implemented to support Gluten's columnar data format. • **Fallback Mechanism**: Provides fallback to vanilla Spark for unsupported operators. Gluten's ColumnarToRow (C2R) and RowToColumnar (R2C) convert data between Gluten's columnar format and Spark's internal row format to support fallback transitions. • **Metrics**: Collected from Gluten native engine to help monitor execution, identify bugs, and diagnose performance bottlenecks. The metrics are displayed in Spark UI. • **Shim Layer**: Ensures compatibility with multiple Spark versions. Gluten supports the latest 3–4 Spark releases during its development cycle, and currently supports Spark 3.2, 3.3, 3.4, and 3.5. • User Guide Below is a basic configuration to enable Gluten in Spark. There are two ways to acquire Gluten jar for the above configuration. Use Released JAR Please download the tar package here, then extract Gluten JAR from it. Additionally, Gluten provides nightly builds based on the main branch for early testing. The nightly build JARs are available at Apache Gluten Nightlies. They have been verified on Centos 7/8/9, Ubuntu 20.04/22.04. Build From Source For **Velox** backend, please refer to Velox.md and build-guide.md. For **ClickHouse** backend, please refer to ClickHouse.md. The Gluten JAR will be generated under after the build. Configurations Common configurations used by Gluten are listed in Configuration.md. Velox specific configurations are listed in velox-configuration.md. The Gluten Velox backend honors some Spark configurations, ignores others, and many are transparent to it. See velox-spark-configuration.md for details, and velox-parquet-write-configuration.md for Parquet write configurations. • Resources • Gluten website • Velox repository • ClickHouse repository • Gluten Intro Video at Data AI Summit 2022 • Gluten Intro Article on Medium • Gluten Intro Article on Kyligence.io (Chinese) • Velox Intro from Meta • Contribution Welcome to contribute to the Gluten project! See CONTRIBUTING.md for guidelines on how to make contributions. • Community Gluten successfully became an Apache Incubator project in March 2024. Here are several ways to connect with the community. GitHub Welcome to report issues or start discussions in GitHub. Please search the GitHub issue list before creating a new one to avoid duplication. Mailing List For any technical discussions, please email dev@gluten.apache.org. You can browse the archives to view past discussions, or subscribe to the mailing list to receive updates. Slack Channel (English) Request an invitation to the ASF Slack workspace via this page. Once invited, you can join the **gluten** channel. The ASF Slack login entry…