HKUSTDial / flash-sparse-attention
Trainable fast and memory-efficient sparse attention
View on GitHubAI Architecture Analysis
This repository is indexed by RepoMind. By analyzing HKUSTDial/flash-sparse-attention in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler view**English** | 简体中文 Flash-Sparse-Attention is a high-performance trainable sparse attention implementation that combines Flash Attention's memory efficiency with sparse computation for handling extremely long sequences in Transformer models. Key Features > [!NOTE] > Support for arbitrary mask and bias shapes is available in this branch. The current main branch no longer maintains that feature set. Supported Features • Forward and backward passes for dense attention, sparse attention, and gated attention • Regular batched inputs and varlen inputs • Causal attention and local window attention • Arbitrary combinations of Q and KV sequence lengths, with head dimensions up to 256 • Grouped Query Attention and Multi Query Attention • Sparse softmax threshold control • Gated attention with gate inputs and configurable gating sparsity • Split-KV path optimization for decoding workloads Features We Aim to Support • Paged Attention • TMA, WGMMA, and FP8 low precision • Sequence parallelism Installation Requirements • **Linux**: Ubuntu 22.04 or later • **NVIDIA GPU**: Compute Capability 8.0 or higher • **Runtime**: NVIDIA driver and runtime compatible with your PyTorch and Triton installation • **Python**: 3.9 or later • **PyTorch**: 2.5.1 or later • **Triton**: Installed automatically as a default dependency Install Install from PyPI: To install from source: Quick Start Basic Usage Below are examples for the three common attention variants: Dense Attention Use this when you do not need explicit sparsification but still want an efficient attention kernel. Sparse Attention Use this when you want to skip low-contribution attention weights through and reduce effective compute on long sequences. Gated Attention Use this when you need explicit gating signals for sparse attention. controls query-side gating and controls key-side gating. Performance The following benchmarks were collected on SM120 and cover forward, backward, and decoding workloads. They include Dense, Sparse, and Gated implementations, with FlashAttention as a baseline. Forward Performance Backward Performance Decode Performance Benchmarking Benchmark scripts are located under tests, covering forward, backward, and decoding performance. By default, these scripts use the attention projection layers from the Qwen model family to generate Q, K, and V states with distributions closer to real LLM workloads, and they build input sequences from the Needle-in-a-Haystack dataset. Forward Performance Backward Performance Decode Performance Citation If you use FSA in your research, please cite: Acknowledgments This project builds upon and integrates several excellent works: • **OpenSeek** - Kernel development support • **Flash-Attention** - Memory-efficient attention computation • **NVIDIA CUTLASS** - High-performance matrix operations library We thank the open-source community for its contributions to efficient Transformer implementations.