herimor / voxtream

VoXtream is a Full-Stream Zero-shot TTS model with Extremely Low Latency and Speaking rate Control

196 stars

23 forks

2 issues

PythonDockerfile

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing herimor/voxtream in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/herimor/voxtream)

Preview:

Repository Overview (README excerpt)

Crawler view

VoXtream2: Full-stream TTS with dynamic speaking rate control We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. Key featues • **Dynamic speed control**: Distribution matching and Classifier-free guidance allow for a fine-grained speaking rate control, which can be adjusted as the model generates speech. • **Streaming performance**: Works **4x** times faster than real-time and achieves **74 ms** first packet latency in a full-stream on a consumer GPU. • **Translingual capability**: Prompt text masking enables support of acoustic prompts in any language. Updates • : We released VoXtream2. • : VoXtream is accepted for an oral presentation at ICASSP 2026. • : We released VoXtream. Now available at voxtream branch. Installation eSpeak NG phonemizer Pip package Usage • Prompt audio: a file containing 3-10 seconds of the target voice. The maximum supported length is 20 seconds (longer audio will be trimmed). • Text: What you want the model to say. The maximum supported length is 1000 characters (longer text will be trimmed). • Speaking rate (optional): target speaking rate in syllables per second. **Notes**: • The model was tested on Ubuntu 22.04, CUDA 12 and PyTorch 2.4. • The model requires 4.2Gb of VRAM (can be reduced to 2.2Gb by disabling speech enhancement module). • Maximum generation length is limited to 1 minute. • The initial run may take a bit longer to download model weights and warmup model graph. • If you experience problems with CUDAGraphs please check this issue. Command line Output streaming Full streaming (slow speech, 2 syllables per second) Python API Gradio demo To start a gradio web-demo run: Websocket To start a websocket server run: To send a request to the server run: It sends a path to the audio prompt and a text to the server and immediately plays audio from the output stream. Evaluation To reproduce evaluation metrics from the paper check evaluation section. Training • Build the Docker container. If you have another version of Docker compose installed use instead. • Run training using the script. You should specify GPU IDs that will be seen inside the container, ex. . Specify the batch size according to your GPU. The default batch size is 64 (tested on H200), batch size 12 fits into RTX3090. The dataset will be downloaded automatically to the HF cache directory. Dataset size is 80Gb. The data will be loaded to RAM during training, make sure you can allocate ~80Gb of RAM per GPU. Results will be stored at the directory. Example of running the training using 2 GPUs with batch size 32: Custom dataset To prepare a custom training dataset check dataset section. Benchmark To evaluate model's real time factor (RTF) and First packet latency (FPL) run . You can compile model for faster inference using flag (note that initial compilation take some time). | Device | Compiled | FPL, ms | RTF | | :-: | :-: | :-: | :-: | | RTX3090 | | 74 | 0.256 | | RTX3090 | :heavy_check_mark: | 63 | 0.173 | TODO • [ ] Add finetuning instructions License The code in this repository is provided under the MIT License. The Depth Transformer component from SesameAI-CSM is included under the Apache 2.0 License (see LICENSE-APACHE and NOTICE). The model weights were trained on data licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). Redistribution of the weights must include proper attribution to the original dataset creators (see ATTRIBUTION.md). Acknowledgements • Mimi: Streaming audio codec from Kyutai • CSM: Conversation speech model from Sesame • ReDimNet: Speaker recognition model from IDR&D • Sidon: Speech enhancemnet model from SaruLab Citation Disclaimer Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.