back to home

stepfun-ai / Step-Audio-EditX

A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech

881 stars
60 forks
37 issues
PythonCudaC

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing stepfun-ai/Step-Audio-EditX in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/stepfun-ai/Step-Audio-EditX)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

Step-Audio-EditX             🔥🔥🔥 News!!! • Jan 29, 2026: • 🧩 New Model Release: • Better performance, with an overall improvement of over 4%. • More **paralinguistic** tags have been added, including ** **, ** **, ** **, ** **, ** **, ** **. • Welcome to try out at StepFun Audio Studio • 💻 We release the **SFT**, **DPO** and **GRPO** training code. • 🌟 Training and inference for **vLLM** are now supported. Thanks to the vLLM team! • Nov 28, 2025: 🚀 New Model Release: Now supporting ** ** and ** ** languages. • Nov 23, 2025: 📊 Step-Audio-Edit-Benchmark Released! • Nov 19, 2025: ⚙️ We release a **new version** of our model, which **supports polyphonic pronunciation control** and improves the performance of emotion, speaking style, and paralinguistic editing. • Nov 12, 2025: 📦 We release the **optimized inference code** and **model weights** of **Step-Audio-EditX** (HuggingFace; ModelScope) and **Step-Audio-Tokenizer**(HuggingFace; ModelScope) • Nov 07, 2025: ✨ Demo Page ; 🎮 HF Space Playground • Nov 06, 2025: 👋 We release the technical report of Step-Audio-EditX. Introduction We are open-sourcing Step-Audio-EditX, a powerful **3B-parameter** LLM-based **Reinforcement Learning** audio model specialized in expressive and iterative audio editing. It excels at editing emotion, speaking style, and paralinguistics, and also features robust zero-shot text-to-speech (TTS) capabilities. Wechat developer group 📑 Open-source Plan • [x] Inference Code • [x] Online demo (Gradio) • [x] Step-Audio-Edit-Benchmark • [x] Model Checkpoints • [x] Step-Audio-Tokenizer • [x] Step-Audio-EditX • [x] Step-Audio-EditX-Int4 • [ ] Training Code • [x] SFT training • [x] DPO training • [x] GRPO training • [ ] PPO training • [ ] ⏳ Feature Support Plan • [ ] Editing • [x] Polyphone pronunciation control • [x] More paralinguistic tags ([Cough, Crying, Stress, etc.]) • [ ] Filler word removal • [ ] Other Languages • [x] Japanese, Korean • [ ] Arabic, French, Russian, Spanish, etc. Features • **Zero-Shot TTS** • Excellent zero-shot TTS cloning for Mandarin, English, Sichuanese, and Cantonese. • To use dialect or other languages, just add a ** ** / ** ** / ** ** / ** ** tag before your text. • 🔥 Polyphone pronunciation control, all you need to do is replace the polyphonic characters with pinyin. • **[我也想过过过儿过过的生活]** -> **[我也想guo4guo4guo1儿guo4guo4的生活]** • **Emotion and Speaking Style Editing** • Remarkably effective iterative control over emotions and styles, supporting **dozens** of options for editing. • Emotion Editing : [ *Angry*, *Happy*, *Sad*, *Excited*, *Fearful*, *Surprised*, *Disgusted*, etc. ] • Speaking Style Editing: [ *Act_coy*, *Older*, *Child*, *Whisper*, *Serious*, *Generous*, *Exaggerated*, etc.] • Editing with more emotion and more speaking styles is on the way. **Get Ready!** 🚀 • **Paralinguistic Editing** • Precise control over 10 types of paralinguistic features for more natural, human-like, and expressive synthetic audio. • Supporting Tags: • [ *Breathing*, *Laughter*, *Surprise-oh*, *Confirmation-en*, *Uhm*, *Surprise-ah*, *Surprise-wa*, *Sigh*, *Question-ei*, *Dissatisfaction-hnn* ] • **Available Tags** emotion happy Expressing happiness angry Expressing anger sad Expressing sadness fear Expressing fear surprised Expressing surprise confusion Expressing confusion empathy Expressing empathy and understanding embarrass Expressing embarrassment excited Expressing excitement and enthusiasm depressed Expressing a depressed or discouraged mood admiration Expressing admiration or respect coldness Expressing coldness and indifference disgusted Expressing disgust or aversion humour Expressing humor or playfulness speaking style serious Speaking in a serious or solemn manner arrogant Speaking in an arrogant manner child Speaking in a childlike manner older Speaking in an elderly-sounding manner girl Speaking in a light, youthful feminine manner pure Speaking in a pure, innocent manner sister Speaking in a mature, confident feminine manner sweet Speaking in a sweet, lovely manner exaggerated Speaking in an exaggerated, dramatic manner ethereal Speaking in a soft, airy, dreamy manner whisper Speaking in a whispering, very soft manner generous Speaking in a hearty, outgoing, and straight-talking manner recite Speaking in a clear, well-paced, poetry-reading manner act_coy Speaking in a sweet, playful, and endearing manner warm Speaking in a warm, friendly manner shy Speaking in a shy, timid manner comfort Speaking in a comforting, reassuring manner authority Speaking in an authoritative, commanding manner chat Speaking in a casual, conversational manner radio Speaking in a radio-broadcast manner soulful Speaking in a heartfelt, deeply emotional manner gentle Speaking in a gentle, soft manner story Speaking in a narrative, audiobook-style manner vivid Speaking in a lively, expressive manner program Speaking in a show-host/presenter manner news Speaking in a news broadcasting manner advertising Speaking in a polished, high-end commercial voiceover manner roar Speaking in a loud, deep, roaring manner murmur Speaking in a quiet, low manner shout Speaking in a loud, sharp, shouting manner deeply Speaking in a deep and low-pitched tone loudly Speaking in a loud and high-pitched tone paralinguistic [sigh] Sighing sound [inhale] Inhaling sound [laugh] Laughter sound [chuckle] Chuckling sound [exhale] Exhaling sound [clears throat] Throat clearing sound [snort] Snorting sound [giggle] Giggling sound [cough] Coughing sound [breath] Breathing sound [uhm] Hesitation sound: "Uhm" [Confirmation-en] Confirming: "En" [Surprise-oh] Expressing surprise: "Oh" [Surprise-ah] Expressing surprise: "Ah" [Surprise-wa] Expressing surprise: "Wa" [Surprise-yo] Expressing surprise: "Yo" [Dissatisfaction-hnn] Dissatisfied sound: "…