back to home

LiveBench / LiveBench

LiveBench: A Challenging, Contamination-Free LLM Benchmark

View on GitHub
1,102 stars
99 forks
126 issues
PythonShell

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing LiveBench/LiveBench in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/LiveBench/LiveBench)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

LiveBench 🏆 Leaderboard • 💻 Data • 📝 Paper LiveBench appeared as a Spotlight Paper in ICLR 2025. Top models as of 30th September 2024 (for a full up-to-date leaderboard, see here): Please see the changelog for details about each LiveBench release. Table of Contents • Introduction • Installation Quickstart • Usage • Data • Adding New Questions • Evaluating New Models and Configuring API Parameters • Documentation • Citation Introduction Introducing LiveBench: a benchmark for LLMs designed with test set contamination and objective evaluation in mind. LiveBench has the following properties: • LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. • Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge. • LiveBench currently contains a set of 18 diverse tasks across 6 categories, and we will release new, harder tasks over time. **We will evaluate your model!** Open an issue or email us at livebench@livebench.ai! Installation Quickstart We recommend using a virtual environment to install LiveBench. To generate answers with API models (i.e. with ), conduct judgments, and show results: To score results on the coding tasks (code_completion and code_generation), you will also need to install the required dependencies: Note that, to evaluate the agentic coding questions, you will need docker installed and available (i.e. the command should work). This will be checked prior to such tasks being run. **Note about local models**: Local model inference is unmaintained. We highly recommend serving your model on an OpenAI compatible API using vllm and performing inference using . Our repo contains code from LiveCodeBench and IFEval. Usage Running Evaluations The simplest way to run LiveBench inference and scoring is using the script, which handles the entire evaluation pipeline including generating answers, scoring them, and showing results. Basic usage: Some common options: • : Specify which subset(s) of questions to use (e.g. for all questions, for coding tasks only) • : The model to evaluate • : Maximum number of tokens in model responses (defaults to 4096 unless overriden for specific models) • : Custom API endpoint for OpenAI-compatible servers • : Environment variable name containing the API key (defaults to OPENAI_API_KEY for OpenAI models) • : Raw API key value • : Number of concurrent API requests (for models with high rate limits) • : Continue from a previous interrupted run • : Retry questions that failed in previous runs • : Evaluate questions from a specific LiveBench release Run to see all available options. When this is finished, follow along with Viewing Results to view results. **Note: The current LiveBench release is 2025-04-25; however, not all questions for this release are public on Huggingface. In order to evaluate all categories, you will need to pass to all scripts to use the most recent public questions.** **Note: Evaluation of the agentic coding tasks require the building of task-specific Docker images. Storing all of these images may take up to 150GB. Images are needed both for inference and evaluation. In the future we will work on optimizing the evaluation process for this task to minimize storage requirements.** Parallel Evaluation Options LiveBench provides two different arguments for parallelizing evaluations, which can be used independently or together: • : Runs separate tasks/categories in parallel by creating multiple tmux sessions. Each category or task runs in its own terminal session, allowing simultaneous evaluation across different benchmark subsets. This also parallelizes the ground truth evaluation phase. By default, this will create one session for each category; if is supplied, there will be one session for each value of . • : Sets the number of concurrent questions to be answered within a single task evaluation instance. This controls how many API requests are made simultaneously for a specific task. **When to use which option:** • **For high rate limits (e.g., commercial APIs with high throughput):** • Use both options together for maximum throughput when evaluating the full benchmark. • For example: • **For lower rate limits:** • When running the entire LiveBench suite, is recommended to parallelize across categories, even if must be kept low. • For small subsets of tasks, may be more efficient as the overhead of creating multiple tmux sessions provides less benefit. • Example for lower rate limits on full benchmark: • **For single task evaluation:** • When running just one or two tasks, use only : Note that requires tmux to be installed on your system. The number of tmux sessions created will depend on the number of categories or tasks being evaluated. Viewing Results You can view the results of your evaluations using the script: is a space-separated list of model IDs to show. For example, to show the results of gpt-4o and claude-3-5-sonnet on coding tasks, run: Multiple values can be provided to see scores on specific subsets of benchmarks: If no argument is provided, all models will be shown. The argument defaults to but should match what was used during evaluation, as should . The leaderboard will be displayed in the terminal. You can also find the breakdown by category in and by task in . Error Checking The script will print out questions for which a model's output is , which indicates repeated API call failures. You can use the script to rerun the failed questions, or run as normal with the and arguments. By default, LiveBench will retry API calls three times and will include a delay in between attempts to account for rate limits. If the errors seen during evaluation are due to rate limits, you may need to switch to or and decrease the value of . If a…