Tobertz-max / DiFlow-TTS

DiFlow-TTS delivers low-latency zero-shot TTS via discrete flow matching and factorized speech tokens. A compact, open framework for fast voice synthesis.🐙

View on GitHub

53 stars

5 forks

0 issues

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing Tobertz-max/DiFlow-TTS in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/Tobertz-max/DiFlow-TTS)

Preview:

Repository Overview (README excerpt)

Crawler view

DiFlow-TTS: Fast Zero-Shot TTS with Discrete Flow Matching DiFlow-TTS delivers low-latency, zero-shot text-to-speech through discrete flow matching and factorized speech tokens. It combines a compact token representation with a flow-based sampler to produce natural speech quickly, even for unseen speakers and languages. This README covers what the project is, how it works, how to use it, and how to contribute. For quick access to build artifacts and binaries, visit the Releases page linked above. Table of contents • What is DiFlow-TTS? • Why DiFlow-TTS matters • Key ideas and design goals • Architecture overview • Tokenization and factorized speech tokens • Inference and training workflow • Getting started • Quick start: run a demo • Binaries and releases • Data, evaluation, and benchmarks • API reference • Advanced usage • Deployment scenarios • Troubleshooting • Development and contribution • Licensing and credits • Roadmap • FAQ • Community and support What is DiFlow-TTS? DiFlow-TTS is a text-to-speech system built to be fast, flexible, and robust in zero-shot scenarios. It uses discrete flow matching to align linguistic features with speech tokens in a compact, factorized form. The approach enables low-latency synthesis and easy adaptation to new voices and languages without large amounts of labeled data. Why DiFlow-TTS matters • Low latency: The system is designed to produce speech in real time, suitable for interactive applications, voice assistants, and captions in streaming services. • Zero-shot capability: It handles new speakers and accents without requiring extensive fine-tuning data. • Efficient representations: Factorized speech tokens reduce memory usage and improve inference speed while preserving natural prosody. • Clear separation of concerns: A modular design lets researchers plug in better tokenizers, better vocoders, or alternative flows without reworking the entire pipeline. Key ideas and design goals • Discrete flow matching: A flow-based mechanism that maps linguistic representations to discrete speech tokens with stable, invertible transformations. • Factorized tokens: Speech tokens are split into independent factors (e.g., phonetic content, prosody cues, speaker style) to improve generalization. • Latency awareness: The model architecture prioritizes streaming-friendly operations and parallelizable components. • Robust zero-shot: The training regime emphasizes broad phonetic coverage and style diversity to support zero-shot synthesis. • Accessibility: A clean, well-documented interface with reasonable defaults to help newcomers and experienced researchers alike. Architecture overview • Text frontend: Converts input text into a linguistic representation. This stage handles tokenization, normalization, and phoneme/lingueme mapping. • Discrete flow matcher: The core of the model. It learns to map linguistic representations into discrete speech tokens via a flow-based mechanism. The flow is designed to be invertible, making it efficient for training and enabling controllable synthesis. • Token encoder/decoder: Works with a factorized token representation. The encoder transforms linguistic signals into a compact set of discrete tokens; the decoder reconstructs the speech waveform or spectrogram conditioned on those tokens. • Vocoder or neural synthesizer: Converts the generated tokens into waveform. A lightweight neural vocoder emphasizes speed while preserving naturalness. • Post-processing: Optional steps for waveform smoothing, noise suppression, and energy normalization to ensure consistent audio output. Tokenization and factorized speech tokens • Factorization: Speech tokens are split into subcomponents such as content tokens, prosody tokens, and speaker/style tokens. This separation helps the model generalize to new voices and speaking styles. • Discrete tokens: Using discrete tokens simplifies the learning problem and makes the flow operations more stable. It also enables compact models with fast decoding. • Prosody control: The design includes explicit control over duration, pitch, and energy, enabling expressive speech while preserving natural rhythm. • Speaker and language adaptation: By decoupling content from speaker attributes, the model can adapt to new voices with limited data. Inference and training workflow • Training loop: The model learns to predict the next discrete token given the prior tokens and the linguistic context. The loss combines a reconstruction term for tokens and a teacher-forcing or sampling objective for stability. • Inference loop: Given text input, the model predicts a sequence of tokens and then synthesizes audio via the vocoder. Latent flows are resolved deterministically for speed. • Zero-shot strategy: A diverse training corpus, plus explicit alignment of token factors, helps the system generalize to unseen voices and languages. • Real-time considerations: The pipeline is designed to minimize sequential dependencies, enabling streaming inference with low latency. Getting started • System requirements • Linux or macOS • Python 3.8–3.11 • A CUDA-capable GPU (optional but recommended for speed) • Adequate RAM (16 GB or more recommended for training) • Dependencies • PyTorch (stable release) • torchaudio • numpy, scipy • transformers or a lightweight tokenizer (if using a BPE or phoneme-based frontend) • librosa or an equivalent audio utility library for audio processing • Environment setup (example) • Create a virtual environment • Install dependencies from a requirements file • Prepare a small dataset for quick testing • Data preparation • Text data: clean, normalized, and tokenized • Speech data: aligned audio with reference transcripts • Optional speaker IDs or style descriptors for conditioning • Training and evaluation scripts • A training script that supports multi-GPU setups • An evaluation script that computes MOS, STOI, PESQ, and other quality metrics • A script to run a quick, in-browser demo or streaming demo…