andimarafioti / faster-qwen3-tts
Real-time text-to-speech with Qwen3-TTS
AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing andimarafioti/faster-qwen3-tts in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler viewFaster Qwen3-TTS Real-time Qwen3-TTS inference using CUDA graph capture. No Flash Attention, no vLLM, no Triton. Just . Supports both streaming and non-streaming generation. Install Requires: Python 3.10+, PyTorch 2.5.1+, NVIDIA GPU with CUDA. **PyTorch compatibility note:** CUDA-graph capture in the fast path is not reliable on for this project (capture can fail with "operation not permitted when stream is capturing"). We validated as working and set that as the minimum supported version. Quick Start Python CLI Voice cloning (reference audio): CustomVoice (predefined speaker IDs): VoiceDesign (instruction-based): Streaming (prints RTF after write): Server mode (keep model hot, stop with ): Demo UI A minimal web UI that streams audio in real time and shows TTFA and RTF live: Features: voice clone (upload any WAV or use your microphone), voice design (1.7B-VoiceDesign model), streaming/non-streaming toggle, adjustable chunk size, live TTFA/RTF metrics, WAV download. OpenAI-compatible API server exposes a endpoint that follows the OpenAI TTS API contract, so it works out of the box with OpenWebUI, llama-swap, and any other OpenAI-compatible client. To expose multiple voices, pass a JSON file mapping names to reference audio configs — each value in a request will be routed to the matching entry ( ). WAV and PCM formats stream chunks as they are generated; MP3 requires . Results Benchmarks include tokenization + inference (apples-to-apples with baseline). RTF > 1.0 = faster than real-time. TTFA measured as time to first playable audio chunk using streaming (chunk_size=8). 0.6B Model | GPU | Baseline RTF | Baseline TTFA | CUDA Graphs RTF | CUDA Graphs TTFA | Speedup | |---|---|---|---|---|---| | Jetson AGX Orin 64GB | 0.179 | 3,641ms | 1.307 | 597ms | 7.3x / 6.1x | | DGX Spark (GB10) | 1.17 | 567ms | 2.56 | 280ms | 2.2x / 2.0x | | RTX 4090 | 0.82 | 800ms | **4.78** | **156ms** | 5.8x / 5.1x | | RTX 4060 (Windows) | 0.23 | 2,697ms | **2.26** | **413ms** | 9.8x / 6.5x | | H100 80GB HBM3 | 0.435 | 1,474ms | **3.884** | **228ms** | 8.9x / 6.5x | 1.7B Model | GPU | Baseline RTF | Baseline TTFA | CUDA Graphs RTF | CUDA Graphs TTFA | Speedup | |---|---|---|---|---|---| | Jetson AGX Orin 64GB | 0.183 | 3,573ms | 1.089 | 693ms | 6.0x / 5.2x | | DGX Spark (GB10) | 1.01 | 661ms | 1.87 | 400ms | 1.9x / 1.7x | | RTX 4090 | 0.82 | 850ms | **4.22** | **174ms** | 5.1x / 4.9x | | RTX 4060 (Windows) | 0.23 | 2,905ms | **1.83** | **460ms** | 7.9x / 6.3x | | H100 80GB HBM3 | 0.439 | 1,525ms | **3.304** | **241ms** | 7.5x / 6.3x | **Note:** Baseline TTFA values are **streaming TTFA** from the community fork (which adds streaming) or from our **dynamic-cache parity streaming** path (no CUDA graphs) where available. The official repo does **not** currently support streaming, so without a streaming baseline TTFA would be **time-to-full-audio**. CUDA graphs uses for TTFA. Both include text tokenization for fair comparison. Speedup shows throughput / TTFA improvement. The streaming fork reports additional speedups that appear tied to ; we couldn’t reproduce those on Jetson-class devices where isn’t available. **GPU architecture notes:** RTX 4090 (2.5 GHz clocks) outperforms H100 (1.8 GHz) for single-stream workloads. H100's lower baseline (RTF 0.59 vs 4090's 0.82) reflects design optimization for batch processing rather than single-stream inference. Benchmark your hardware Benchmarks run from source. You only need uv and : **Linux / macOS / WSL:** **Windows (Native):** Results are saved as and audio samples as / . Streaming CUDA graphs support streaming output — audio chunks are yielded during generation with the same per-step performance as non-streaming mode. Chunk size vs performance (Jetson AGX Orin, 0.6B) | chunk_size | TTFA | RTF | Audio per chunk | |---|---|---|---| | 1 | 240ms | 0.750 | 83ms | | 2 | 266ms | 1.042 | 167ms | | 4 | 362ms | 1.251 | 333ms | | 8 | 556ms | 1.384 | 667ms | | 12 | 753ms | 1.449 | 1000ms | | Non-streaming | — | 1.57 | all at once | Smaller chunks = lower latency but more decode overhead. is the smallest that stays real-time on Jetson. **Model seed:** All the different model modes are effectively the same speed. The first time you clone a voice, it takes longer, but later it's cached. Use to reproduce. Example on 0.6B, : | Mode | TTFA (ms) | RTF | ms/step | | ---- | --------- | --- | ------- | | VoiceClone xvec | 152 ± 11 | 5.470 ± 0.032 | 15.2 ± 0.1 | | VoiceClone full ICL | 149 ± 1 | 5.497 ± 0.026 | 15.2 ± 0.1 | | CustomVoice | 148 ± 1 | 5.537 ± 0.020 | 15.0 ± 0.1 | How streaming works The CUDA graphs are unchanged — both predictor and talker graphs are replayed per step. The streaming generator yields codec ID chunks every steps, and the model wrapper decodes each chunk to audio using a sliding window with 25-frame left context (matching the upstream codec's pattern) to avoid boundary artifacts. Voice Cloning Quality Cloning modes exposes two modes via : | Mode | | Notes | |---|---|---| | Simple (x-vector) | | Speaker embedding only — shorter prefill, clean language switching, no needed | | Advanced (ICL) | (default) | Full reference audio in context — requires accurate , may produce a brief artifact at the start since it literally continues the sentence you use | The default now matches upstream Qwen3-TTS: ICL mode with the reference audio in context. X-vector-only mode remains available as an opt-in for cleaner language switching and shorter prefills. Decoder context (ICL mode) The 12 Hz codec uses a causal : each frame is reconstructed using prior frames as acoustic context. In ICL mode the reference audio codec tokens are prepended to the generated tokens before decoding, then the reference portion is trimmed from the output. Without this, the codec decoder starts cold with no voice context — the model generates the right tokens but they get reconstructed in the wrong voice. This is handled automatically. Text input streaming vs No…