pnnbao97 / VieNeu-TTS

Vietnamese TTS with instant voice cloning • On-device • Real-time CPU inference • 24kHz audio quality • Chuyển văn bản thành giọng nói tiếng Việt • Text to speech tiếng Việt • TTS tiếng Việt

View on GitHub

943 stars

313 forks

7 issues

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing pnnbao97/VieNeu-TTS in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/pnnbao97/VieNeu-TTS)

Preview:

Repository Overview (README excerpt)

Crawler view

🦜 VieNeu-TTS **VieNeu-TTS** is an advanced on-device Vietnamese Text-to-Speech (TTS) model with **instant voice cloning**. > [!TIP] > **Voice Cloning:** All model variants (including GGUF) support instant voice cloning with just **3-5 seconds** of reference audio. This project features two core architectures trained on the VieNeu-TTS-1000h dataset: • **VieNeu-TTS (0.5B):** An enhanced model optimized for maximum stability. • **VieNeu-TTS-0.3B:** A specialized model **trained from scratch** using the VieNeu-TTS-1000h dataset, delivering 2x faster inference and ultra-low latency. These represent a significant upgrade with the following improvements: • **Enhanced pronunciation**: More accurate and stable Vietnamese pronunciation powered by the sea-g2p library • **Code-switching support**: Seamless transitions between Vietnamese and English powered by the sea-g2p library • **Better voice cloning**: Higher fidelity and speaker consistency • **Real-time synthesis**: 24 kHz waveform generation on CPU or GPU • **Multiple model formats**: Support for PyTorch, GGUF Q4/Q8 (CPU optimized), and ONNX codec VieNeu-TTS delivers production-ready speech synthesis fully offline. **Author:** Phạm Nguyễn Ngọc Bảo --- --- 📌 Table of Contents • 🦜 Installation & Web UI • 📦 Using the Python SDK • 🐳 Docker & Remote Server • 🎯 Custom Models • 🛠️ Fine-tuning Guide • 🔬 Model Overview • 🐋 Deployment with Docker (Compose) • 🚀 Roadmap • 🤝 Support & Contact --- 🦜 1. Installation & Web UI > *For Intel arc gpu user, read the Intel Arc GPU section below.* Installation Steps • **Clone the Repo:** • **Environment Setup with (Recommended):** • **Step A: Install uv (if you haven't)** • **Step B: Install dependencies** **Option 1: GPU Support (Default)** **Option 2: CPU-ONLY (Lightweight, no CUDA)** • **Start the Web UI:** Access the UI at . ⚡ Real-time Streaming (CPU Optimized) VieNeu-TTS supports **ultra-low latency streaming**, allowing audio playback to start before the entire sentence is finished. This is specifically optimized for **CPU-only** devices using the GGUF backend. • **Latency:** Intel Arc GPU Users - Installation Guide: • **Clone the Repo:** • **Environment and dependencies setup with (Recommended):** • run setup_xpu_uv.bat • **Start the Web UI:** • run run_xpu.bat Access the UI at . --- 📦 2. Using the Python SDK (vieneu) Integrate VieNeu-TTS into your own software projects. Quick Install Quick Start (main.py) *For full implementation details, see examples/main.py.* --- 🐳 3. Docker & Remote Server Deploy VieNeu-TTS as a high-performance API Server (powered by LMDeploy) with a single command. • Run with Docker (Recommended) **Requirement**: NVIDIA Container Toolkit is required for GPU support. **Start the Server with a Public Tunnel (No port forwarding needed):** • **Default**: The server loads the model for maximum quality. • **Tunneling**: The Docker image includes a built-in tunnel. Check the container logs to find your public address (e.g., ). • Using the SDK (Remote Mode) Once the server is running, you can connect from anywhere (Colab, Web Apps, etc.) without loading heavy models locally: *For full implementation details, see: examples/main_remote.py* Voice Preset Specification (v1.0) VieNeu-TTS uses the official specification to define reusable voice assets. Only files following this spec are guaranteed to be compatible with VieNeu-TTS SDK ≥ v1.x. • Advanced Configuration Customize the server to run specific versions or your own fine-tuned models. **Run the 0.3B Model (Faster):** **Serve a Local Fine-tuned Model:** If you have merged a LoRA adapter, mount your output directory to the container: *For full implementation details, see: examples/main_remote.py* --- 🎯 4. Custom Models (LoRA, GGUF, Finetune) VieNeu-TTS allows you to load custom models directly from HuggingFace or local paths via the Web UI. *👉 See the detailed guide at: **docs/CUSTOM_MODEL_USAGE.md*** --- 🛠️ 5. Fine-tuning Guide Train VieNeu-TTS on your own voice or custom datasets. • **Simple Workflow:** Use the script with optimized LoRA configurations. • **Documentation:** Follow the step-by-step guide in **finetune/README.md**. • **Notebook:** Experience it directly on Google Colab via . --- 🔬 6. Model Overview (Backbones) | Model | Format | Device | Quality | Speed | | ----------------------- | ------- | ------- | ---------- | ----------------------- | | VieNeu-TTS | PyTorch | GPU/CPU | ⭐⭐⭐⭐⭐ | Very Fast with lmdeploy | | VieNeu-TTS-0.3B | PyTorch | GPU/CPU | ⭐⭐⭐⭐ | **Ultra Fast (2x)** | | VieNeu-TTS-q8-gguf | GGUF Q8 | CPU/GPU | ⭐⭐⭐⭐ | Fast | | VieNeu-TTS-q4-gguf | GGUF Q4 | CPU/GPU | ⭐⭐⭐ | Very Fast | | VieNeu-TTS-0.3B-q8-gguf | GGUF Q8 | CPU/GPU | ⭐⭐⭐⭐ | **Ultra Fast (1.5x)** | | VieNeu-TTS-0.3B-q4-gguf | GGUF Q4 | CPU/GPU | ⭐⭐⭐ | **Extreme Speed (2x)** | 🔬 Model Details • **Training Data:** VieNeu-TTS-1000h — 443,641 curated Vietnamese samples (Used for all versions). • **Audio Codec:** NeuCodec (Torch implementation; ONNX & quantized variants supported). • **Context Window:** 2,048 tokens shared by prompt text and speech tokens. • **Output Watermark:** Enabled by default. --- 🐋 7. Deployment with Docker (Compose) Deploy quickly without manual environment setup. > **Note:** Docker deployment currently supports **GPU only**. For CPU usage, please follow the Installation & Web UI section to install from source. Check docs/Deploy.md for more details. --- 📚 References • **Dataset:** VieNeu-TTS-1000h (Hugging Face) • **Model 0.5B:** pnnbao-ump/VieNeu-TTS • **Model 0.3B:** pnnbao-ump/VieNeu-TTS-0.3B • **LoRA Guide:** docs/CUSTOM_MODEL_USAGE.md --- 🚀 Roadmap We are constantly working to improve VieNeu-TTS. Here is what we have planned: • [ ] **🦜 VieNeu-TTS 2.0**: Upcoming version featuring superior voice cloning fidelity and improved handling of long-context text synthesis. • [ ] **🔊 VieNeu-Codec**: Development of a custom neural audio codec specifically optim…