MatthewCYM / VoiceBench

VoiceBench: Benchmarking LLM-Based Voice Assistants

340 stars

22 forks

7 issues

PythonCudaC++

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing MatthewCYM/VoiceBench in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/MatthewCYM/VoiceBench)

Preview:

Repository Overview (README excerpt)

Crawler view

VoiceBench: Benchmarking LLM-Based Voice Assistants 🏆 Leaderboard | 📄 Paper | 🤗 Data > We encourage new result submissions through the issue tracker. The leaderboard will be updated accordingly. News • ** ** Released , a crowd-sourced dataset comprising human-recorded speech with diverse accents. • ** ** Released , a crowd-sourced dataset comprising human-recorded speech, for evaluating the reasoning ability of voice assistants. • ** ** Updated the VoiceBench Leaderboard to include . • ** ** Added a curated list of awesome voice assistants. • ** ** Expanded the test samples in VoiceBench to include , covering 12 diverse domains from . • ** ** Updated the VoiceBench Leaderboard to include: 1) Mini-Omni2, GPT-4o-Audio, and Whisper-v3+GPT-4o, and 2) multiple-choice QA from OpenBookQA. • ** ** Expanded the test samples in VoiceBench to include: 1) the complete set of open-ended QA from , and 2) multiple-choice QA from . Table of Contents • **Setup** • **Dataset** • **Evaluation** • **Awesome Voice Assistants** • **Citation** Setup Dataset The data used in this project is available at VoiceBench Dataset hosted on Hugging Face. You can access it directly via the link and integrate it into your project by using the Hugging Face library. How to Use the Dataset To load the dataset in your Python environment: Available Data | Subset | # Samples | Audio Source | Task Type | |-----------------|:---------:|:------------:|:---------------------:| | alpacaeval | 199 | Google TTS | Open-Ended QA | | alpacaeval_full | 636 | Google TTS | Open-Ended QA | | commoneval | 200 | Human | Open-Ended QA | | wildvoice | 1,000 | Human | Open-Ended QA | | openbookqa | 455 | Google TTS | Multiple-Choice QA | | mmsu | 3,074 | Google TTS | Multiple-Choice QA | | sd-qa | 553 | Human | Reference-Based QA | | mtbench | 46 | Google TTS | Multi-Turn QA | | ifeval | 345 | Google TTS | Instruction Following | | bbh | 1,000 | Human | Reasoning | | advbench | 520 | Google TTS | Safety | **PS**: contains and data, while is constructed with the complete data. is used in the leaderboard. Evaluation Step 1: Get the Voice Assistant's Response To obtain the responses from the voice assistant model, run the following command: **Supported Arguments:** • : Specifies the model to use for generating responses. Replace with the model you want to test (e.g., , ). • : Selects the subset of the dataset. Replace with other subsets like , , etc., depending on your evaluation needs. • : Chooses the data split to evaluate. • For most datasets ( , , , ), use as the value. • For the subset, you should provide a region code instead of , such as for Australia, for the United States, etc. • : Use for spoken instructions, for text-based instructions. This will generate the output and save it to a file named naive-alpacaeval-test-audio.jsonl. Step2: Automatic GPT-4 Evaluation For datasets , , , and , we use to evaluate the responses. Run the following command to get the GPT score: The GPT evaluation scores will be saved to . **Note:** This step should be skipped for other datasets, as they are not evaluated using GPT-4. Step3: Get the Final Results To generate the final evaluation results, run: **Supported Arguments:** • : Specifies the evaluator type: • Use for , , and . • Use for . • Use for . • Use for . • Use for and . • Use for . Awesome Voice Assistants | Title | Date | Code | |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------:|:------------------------------------------------------------------------:| | **OSUM-EChat: Enhancing End-to-End Empathetic Spoken Chatbot via Understanding-Driven Spoken Dialogue**   | 2025-08-13 | Github | | **DIFFA: Large Language Diffusion Models Can Listen and Understand**   | 2025-07-24 | Github | | **Voxtral** | 2025-07-17 | -- | | **Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models**   | 2025-07-10 | Github | | **DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment**   | 2025-07-03 | Github | | **Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model**   | 2025-06-16 | Github | | **Ming-Omni: A Unified Multimodal Model for Perception and Generation**   | 2025-06-11 | Github | | **Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model** | 2025-06-10 | -- | | **VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model**   | 2025-05-06 | Github | | **LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis**   | 2025-05-05 | Github | | **Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play**   | 2025-05-05 | Github | | **Kimi-Audio Technical Report**   | 2025-04-25 | Github | | **Qwen2.5-Omni Technical Report**   | 2025-03-26 | Github | | **Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs** | 2025-03-03 | HF | | **Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision** | 2025-02-26 | -- | | **M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance** | 2025-02-26 | -- | | **Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction**   | 2025-02-24 | Github | | **LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems** | 2025-02-19 | -- | | **FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems** | 2025-02-19 | -- | | **Step-Audio: Unified Understanding and Generation in Intelligent Speech Int…