back to home

QuentinFuxa / WhisperLiveKit

Simultaneous speech-to-text model

9,812 stars
997 forks
27 issues
PythonJavaScriptCSS

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing QuentinFuxa/WhisperLiveKit in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/QuentinFuxa/WhisperLiveKit)
Preview:Analyzed by RepoMind

Repository Summary (README)

Preview

WLK

WhisperLiveKit: Ultra-low-latency, self-hosted speech-to-text with speaker identification

WhisperLiveKit Demo

PyPI Version PyPI Downloads Python Versions Hugging Face Weights License

Powered by Leading Research:

See the interactive playground in this repo to explore how AlignAtt works

Why not just run a simple Whisper model on every audio batch? Whisper is designed for complete utterances, not real-time chunks. Processing small segments loses context, cuts off words mid-syllable, and produces poor transcription. WhisperLiveKit uses state-of-the-art simultaneous speech research for intelligent buffering and incremental processing.

Architecture

Architecture

The backend supports multiple concurrent users. Voice Activity Detection reduces overhead when no voice is detected.

Installation & Quick Start

pip install whisperlivekit

You can also clone the repo and pip install -e . for the latest version.

Quick Start

  1. Start the transcription server:

    wlk --model base --language en
    
  2. Open your browser and navigate to http://localhost:8000. Start speaking and watch your words appear in real-time!

  • See here for the list of all available languages.
  • Check the troubleshooting guide for step-by-step fixes collected from recent GPU setup/env issues.
  • The CLI entry point is exposed as both wlk and whisperlivekit-server; they are equivalent.
  • For HTTPS requirements, see the Parameters section for SSL configuration options.

Use it to capture audio from web pages.

Go to chrome-extension for instructions.

WhisperLiveKit Demo

Optional Dependencies

Featureuv syncpip install -e
Apple Silicon MLX Whisper backenduv sync --extra mlx-whisperpip install -e ".[mlx-whisper]"
Voxtral (MLX backend, Apple Silicon)uv sync --extra voxtral-mlxpip install -e ".[voxtral-mlx]"
CPU PyTorch stackuv sync --extra cpupip install -e ".[cpu]"
CUDA 12.9 PyTorch stackuv sync --extra cu129pip install -e ".[cu129]"
Translationuv sync --extra translationpip install -e ".[translation]"
Sentence tokenizeruv sync --extra sentence_tokenizerpip install -e ".[sentence_tokenizer]"
Voxtral (HF backend)uv sync --extra voxtral-hfpip install -e ".[voxtral-hf]"
Speaker diarization (Sortformer / NeMo)uv sync --extra diarization-sortformerpip install -e ".[diarization-sortformer]"
[Not recommended] Speaker diarization with Diartuv sync --extra diarization-diartpip install -e ".[diarization-diart]"

Supported GPU profiles:

# Profile A: Sortformer diarization
uv sync --extra cu129 --extra diarization-sortformer

# Profile B: Voxtral HF + translation
uv sync --extra cu129 --extra voxtral-hf --extra translation

voxtral-hf and diarization-sortformer are intentionally incompatible extras and must be installed in separate environments.

See Parameters & Configuration below on how to use them.

Speed vs Accuracy tradeoff

See BENCHMARK.md for the full benchmark with tables, model size comparison, and more. We are actively looking for benchmark results on other hardware (NVIDIA GPUs, different Apple Silicon chips, cloud instances). If you run the benchmarks on your machine, please share your results via an issue or PR!

Voxtral Backend

WhisperLiveKit supports Voxtral Mini, a 4B-parameter speech model from Mistral AI that natively handles 100+ languages with automatic language detection. Whisper also supports auto-detection (--language auto), but Voxtral's per-chunk detection is more reliable and does not bias towards English.

# Apple Silicon (native MLX, recommended)
pip install -e ".[voxtral-mlx]"
wlk --backend voxtral-mlx

# Linux/GPU (HuggingFace transformers)
pip install transformers torch
wlk --backend voxtral

Voxtral uses its own streaming policy and does not use LocalAgreement or SimulStreaming. See BENCHMARK.md for performance numbers.

Usage Examples

Command-line Interface: Start the transcription server with various options:

# Large model and translate from french to danish
wlk --model large-v3 --language fr --target-language da

# Diarization and server listening on */80
wlk --host 0.0.0.0 --port 80 --model medium --diarization --language fr

# Voxtral multilingual (auto-detects language)
wlk --backend voxtral-mlx

Python API Integration: Check basic_server for a more complete example of how to use the functions and classes.

import asyncio
from contextlib import asynccontextmanager

from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.responses import HTMLResponse

from whisperlivekit import AudioProcessor, TranscriptionEngine, parse_args

transcription_engine = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    global transcription_engine
    transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
    yield

app = FastAPI(lifespan=lifespan)

async def handle_websocket_results(websocket: WebSocket, results_generator):
    async for response in results_generator:
        await websocket.send_json(response)
    await websocket.send_json({"type": "ready_to_stop"})

@app.websocket("/asr")
async def websocket_endpoint(websocket: WebSocket):
    global transcription_engine

    # Create a new AudioProcessor for each connection, passing the shared engine
    audio_processor = AudioProcessor(transcription_engine=transcription_engine)    
    results_generator = await audio_processor.create_tasks()
    results_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
    await websocket.accept()
    while True:
        message = await websocket.receive_bytes()
        await audio_processor.process_audio(message)        

Frontend Implementation: The package includes an HTML/JavaScript implementation here. You can also import it using from whisperlivekit import get_inline_ui_html & page = get_inline_ui_html()

Parameters & Configuration

ParameterDescriptionDefault
--modelWhisper model size. List and recommandations heresmall
--model-pathLocal .pt file/directory or Hugging Face repo ID containing the Whisper model. Overrides --model. Recommandations hereNone
--languageList here. If you use auto, the model attempts to detect the language automatically, but it tends to bias towards English.auto
--target-languageIf sets, translates using NLLW. 200 languages available. If you want to translate to english, you can also use --direct-english-translation. The STT model will try to directly output the translation.None
--diarizationEnable speaker identificationFalse
--backend-policyStreaming strategy: 1/simulstreaming uses AlignAtt SimulStreaming, 2/localagreement uses the LocalAgreement policysimulstreaming
--backendASR backend selector. auto picks MLX on macOS (if installed), otherwise Faster-Whisper, otherwise vanilla Whisper. Options: mlx-whisper, faster-whisper, whisper, openai-api (LocalAgreement only), voxtral-mlx (Apple Silicon), voxtral (HuggingFace)auto
--no-vacDisable Voice Activity Controller. NOT ADVISEDFalse
--no-vadDisable Voice Activity Detection. NOT ADVISEDFalse
--warmup-fileAudio file path for model warmupjfk.wav
--hostServer host addresslocalhost
--portServer port8000
--ssl-certfilePath to the SSL certificate file (for HTTPS support)None
--ssl-keyfilePath to the SSL private key file (for HTTPS support)None
--forwarded-allow-ipsIp or Ips allowed to reverse proxy the whisperlivekit-server. Supported types are IP Addresses (e.g. 127.0.0.1), IP Networks (e.g. 10.100.0.0/16), or Literals (e.g. /path/to/socket.sock)None
--pcm-inputraw PCM (s16le) data is expected as input and FFmpeg will be bypassed. Frontend will use AudioWorklet instead of MediaRecorderFalse
--lora-pathPath or Hugging Face repo ID for LoRA adapter weights (e.g., qfuxa/whisper-base-french-lora). Only works with native Whisper backend (--backend whisper)None
Translation optionsDescriptionDefault
--nllb-backendtransformers or ctranslate2ctranslate2
--nllb-size600M or 1.3B600M
Diarization optionsDescriptionDefault
--diarization-backenddiart or sortformersortformer
--disable-punctuation-split[NOT FUNCTIONAL IN 0.2.15 / 0.2.16] Disable punctuation based splits. See #214False
--segmentation-modelHugging Face model ID for Diart segmentation model. Available modelspyannote/segmentation-3.0
--embedding-modelHugging Face model ID for Diart embedding model. Available modelsspeechbrain/spkrec-ecapa-voxceleb
SimulStreaming backend optionsDescriptionDefault
--disable-fast-encoderDisable Faster Whisper or MLX Whisper backends for the encoder (if installed). Inference can be slower but helpful when GPU memory is limitedFalse
--custom-alignment-headsUse your own alignment heads, useful when --model-dir is used. Use scripts/determine_alignment_heads.py to extract them. WhisperLiveKit Demo
None
--frame-thresholdAlignAtt frame threshold (lower = faster, higher = more accurate)25
--beamsNumber of beams for beam search (1 = greedy decoding)1
--decoderForce decoder type (beam or greedy)auto
--audio-max-lenMaximum audio buffer length (seconds)30.0
--audio-min-lenMinimum audio length to process (seconds)0.0
--cif-ckpt-pathPath to CIF model for word boundary detectionNone
--never-fireNever truncate incomplete wordsFalse
--init-promptInitial prompt for the modelNone
--static-init-promptStatic prompt that doesn't scrollNone
--max-context-tokensMaximum context tokensDepends on model used, but usually 448.
WhisperStreaming backend optionsDescriptionDefault
--confidence-validationUse confidence scores for faster validationFalse
--buffer_trimmingBuffer trimming strategy (sentence or segment)segment

For diarization using Diart, you need to accept user conditions here for the pyannote/segmentation model, here for the pyannote/segmentation-3.0 model and here for the pyannote/embedding model. Then, login to HuggingFace: huggingface-cli login

🚀 Deployment Guide

To deploy WhisperLiveKit in production:

  1. Server Setup: Install production ASGI server & launch with multiple workers

    pip install uvicorn gunicorn
    gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app
    
  2. Frontend: Host your customized version of the html example & ensure WebSocket connection points correctly

  3. Nginx Configuration (recommended for production):

    server {
       listen 80;
       server_name your-domain.com;
        location / {
            proxy_pass http://localhost:8000;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
            proxy_set_header Host $host;
    }}
    
  4. HTTPS Support: For secure deployments, use "wss://" instead of "ws://" in WebSocket URL

🐋 Docker

Deploy the application easily using Docker with GPU or CPU support.

Prerequisites

  • Docker installed on your system
  • For GPU support: NVIDIA Docker runtime installed

Quick Start

With GPU acceleration (recommended):

docker build -t wlk .
docker run --gpus all -p 8000:8000 --name wlk wlk

CPU only:

docker build -f Dockerfile.cpu -t wlk --build-arg EXTRAS="cpu" .
docker run -p 8000:8000 --name wlk wlk

Advanced Usage

Custom configuration:

# Example with custom model and language
docker run --gpus all -p 8000:8000 --name wlk wlk --model large-v3 --language fr

Compose (recommended for cache + token wiring):

# GPU Sortformer profile
docker compose up --build wlk-gpu-sortformer

# GPU Voxtral profile
docker compose up --build wlk-gpu-voxtral

# CPU service
docker compose up --build wlk-cpu

Memory Requirements

  • Large models: Ensure your Docker runtime has sufficient memory allocated

Customization

  • --build-arg Options:
    • EXTRAS="cu129,diarization-sortformer" - GPU Sortformer profile extras.
    • EXTRAS="cu129,voxtral-hf,translation" - GPU Voxtral profile extras.
    • EXTRAS="cpu,diarization-diart,translation" - CPU profile extras.
    • Hugging Face cache + token are configured in compose.yml using a named volume and HF_TKN_FILE (default: ./token).

Testing & Benchmarks

WhisperLiveKit includes a unit test suite and an offline benchmark harness.

# Install test dependencies
pip install -e ".[test]"

# Run unit tests (no model download required)
pytest tests/ -v

# Benchmark a single backend
python test_backend_offline.py --backend faster-whisper --no-realtime

# Benchmark all installed backends
python test_backend_offline.py --benchmark --no-realtime

# Export benchmark results as JSON
python test_backend_offline.py --benchmark --no-realtime --json results.json

See BENCHMARK.md for a full comparison of backends, policies, WER, speed, and timestamp accuracy on Apple Silicon.

Use Cases

Capture discussions in real-time for meeting transcription, help hearing-impaired users follow conversations through accessibility tools, transcribe podcasts or videos automatically for content creation, transcribe support calls with speaker identification for customer service...