stepfun-ai / Step-Audio-R1
AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing stepfun-ai/Step-Audio-R1 in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler viewStep-Audio-R1           🔥🔥🔥 News!! • Jan 14, 2026: 🚀 We release the inference code and model weights of **Step-Audio-R1.1** (HuggingFace; ModelScope) • Nov 27, 2025: 🎉 We release the inference code and model weights of **Step-Audio-R1** (HuggingFace; ModelScope) • Nov 27, 2025: 🎮 We released the HF Space Playground • Nov 19, 2025: 🎉 We release the Demo Page • Nov 19, 2025: 👋 We release the technical report of Step-Audio-R1. WeChat Developer Group 📑 Open-source Plan • [x] Inference Code (vLLM) • [x] Online demo (Gradio) • [x] Model Checkpoints Overview of R1.1 Introduction Step-Audio R1.1 (Realtime) is a major upgrade to Step-Audio-R1, designed for interactive spoken dialogue with both **real-time responsiveness** and **strong reasoning capability**. Unlike conventional streaming speech models that trade intelligence for latency, R1.1 enables *thinking while speaking*, achieving high intelligence without sacrificing speed. Mind-Paced Speaking (Low Latency) Based on the research *Mind-Paced Speaking*, the Realtime variant adopts a **Dual-Brain Architecture**: • A **Formulation Brain** responsible for high-level reasoning • An **Articulation Brain** dedicated to speech generation This decoupling allows the model to perform **Chain-of-Thought reasoning during speech output**, maintaining ultra-low latency while handling complex tasks in real time. Acoustic-Grounded Reasoning (High Intelligence) To address the *inverted scaling* issue—where reasoning over transcripts can degrade performance—Step-Audio R1.1 grounds its reasoning directly in acoustic representations rather than text alone. Through iterative self-distillation, extended deliberation becomes a strength instead of a liability. This enables effective test-time compute scaling and leads to **state-of-the-art performance**, including top-ranking results on the AA benchmark. Overview of R1 Introduction Step-Audio-R1 is the **first audio language model to successfully unlock test-time compute scaling**. It decisively solves the "inverted scaling" anomaly plaguing existing models, where performance paradoxically degrades with longer reasoning chains. We identify the root cause of this failure as **Textual Surrogate Reasoning**: conventional models, due to text-based initialization, rely on linguistic abstractions (analyzing transcripts) rather than genuine acoustic properties. To resolve this modality mismatch, we introduce **Modality-Grounded Reasoning Distillation (MGRD)**, an iterative training framework that shifts the model's reasoning focus from textual surrogates to acoustic analysis. This new approach allows us to create **Step-Audio-R1**, which: • Is the **first audio reasoning model** that successfully benefits from test-time compute scaling. • **Surpasses Gemini 2.5 Pro and is comparable to Gemini 3** across comprehensive audio benchmarks. • Transforms extended deliberation from a liability into a **powerful asset** for audio intelligence. Model Architecture Step-Audio-R1 builds on the architecture of our previous StepAudio 2 and consists of three main components: • **Audio Encoder:** We use the pre-trained **Qwen2 audio encoder**. It operates at a 25 Hz frame rate and is frozen during training. • **Audio Adaptor:** A simple adaptor (identical to Step-Audio 2) connects the encoder to the LLM and downsamples the feature frame rate to 12.5 Hz. • **LLM Decoder:** We use **Qwen2.5 32B** as the core reasoning component. It directly takes the latent audio features from the adaptor to generate a purely textual output (first the reasoning, then the final reply). The key innovation is our training method, **Modality-Grounded Reasoning Distillation (MGRD)**. This process iteratively refines the model's thoughts, progressively strengthening their connection to the underlying audio features until they evolve into **"native audio think."** This ensures the model's reasoning is not merely about the transcribed text but is deeply grounded in the **acoustic nuances** of the audio itself. Model Usage 📜 Requirements • **GPU**: NVIDIA GPUs with CUDA support (tested on 4×L40S/H100/H800/H20). • **Operating System**: Linux. • **Python**: >= 3.10.0. ⬇️ Download Model First, you need to download the Step-Audio-R1 model weights. **Method A · Git LFS** **Method B · Hugging Face CLI** 🚀 Deployment and Execution We provide two ways to serve the model: Docker (recommended) or compiling the customized vLLM backend. 🐳 Method 1 · Run with Docker (Recommended) A customized vLLM image is required. • **Pull the image**: • **Start the service**: Assuming the model is downloaded in the folder in the current directory. After the service starts, it will listen on . 🐳 Method 2 · Run from Source (Compile vLLM) Step-Audio-R1 requires a customized vLLM backend. • **Download Source Code**: • **Prepare Environment**: • **Install and Compile**: vLLM contains both C++ and Python code. We mainly modified the Python code, so the C++ part can use the pre-compiled version to speed up the process. • **Switch Branch**: After compilation, switch to the branch that supports Step-Audio. • **Start the Service**: After the service starts, it will listen on . 🧪 Client Examples Get the example code and run it: Citation If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :) Star History