EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

11,718 stars

3,097 forks

754 issues

PythonShellC++

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing EleutherAI/lm-evaluation-harness in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/EleutherAI/lm-evaluation-harness)

Preview:

Repository Overview (README excerpt)

Crawler view

Language Model Evaluation Harness --- Latest News 📣 • [2025/12] **CLI refactored** with subcommands ( , , ) and YAML config file support via . See the CLI Reference and Configuration Guide. • [2025/12] **Lighter install**: Base package no longer includes / . Install model backends separately: , , etc. • [2025/07] Added arg to (token/str), and (str) for stripping CoT reasoning traces from models that support it. • [2025/03] Added support for steering HF models! • [2025/02] Added SGLang support! • [2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the and model types and task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out , a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features. • [2024/07] API model support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. **To run Llama 405B, we recommend using VLLM's OpenAI-compliant API to host the model, and use the model type to evaluate the model.** • [2024/07] New Open LLM Leaderboard tasks have been added ! You can find them under the leaderboard task group. --- Announcement **A new v0.4.0 release of lm-evaluation-harness is available** ! New updates and features include: • **New Open LLM Leaderboard tasks have been added ! You can find them under the leaderboard task group.** • Internal refactoring • Config-based task creation and configuration • Easier import and sharing of externally-defined task config YAMLs • Support for Jinja2 prompt design, easy modification of prompts + prompt imports from Promptsource • More advanced configuration options, including output post-processing, answer extraction, and multiple LM generations per document, configurable fewshot settings, and more • Speedups and new modeling libraries supported, including: faster data-parallel HF model usage, vLLM support, MPS support with HuggingFace, and more • Logging and usability changes • New tasks including CoT BIG-Bench-Hard, Belebele, user-defined task groupings, and more Please see our updated documentation pages in for more details. Development will be continuing on the branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the EleutherAI discord! --- Overview This project provides a unified framework to test generative language models on a large number of different evaluation tasks. **Features:** • Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented. • Support for models loaded via transformers (including quantization via GPTQModel and AutoGPTQ), GPT-NeoX, and Megatron-DeepSpeed, with a flexible tokenization-agnostic interface. • Support for fast and memory-efficient inference with vLLM. • Support for commercial APIs including OpenAI, and TextSynth. • Support for evaluation on adapters (e.g. LoRA) supported in HuggingFace's PEFT library. • Support for local models and benchmarks. • Evaluation with publicly available prompts ensures reproducibility and comparability between papers. • Easy support for custom prompts and evaluation metrics. The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular Open LLM Leaderboard, has been used in hundreds of papers, and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML. Install To install the package from the github repository, run: Installing Model Backends The base installation provides the core evaluation framework. **Model backends must be installed separately** using optional extras: For HuggingFace transformers models: For vLLM inference: For API-based models (OpenAI, Anthropic, etc.): Multiple backends can be installed together: A detailed table of all optional extras is available at the end of this document. Basic Usage Documentation | Guide | Description | |-------|-------------| | CLI Reference | Command-line arguments and subcommands | | Configuration Guide | YAML config file format and examples | | Python API | Programmatic usage with | | Task Guide | Available tasks and task configuration | Use to see available options, or for evaluation options. List available tasks with: Hugging Face > [!Important] > To use the HuggingFace backend, first install: To evaluate a model hosted on the HuggingFace Hub (e.g. GPT-J-6B) on you can use the following command (this assumes you are using a CUDA-compatible GPU): Additional arguments can be provided to the model constructor using the flag. Most notably, this supports the common practice of using the feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model: Models that are loaded via both (autoregressive, decoder-only GPT style models) and (such as encoder-decoder models like T5) in Huggingface are supported. Batch size selection can be automated by setting the flag to . This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append to above flag to automatically recompute the largest batch size times. For example, to recompute the batch size 4 times, the command would be: > [!Note] > Just like you can provide a local path to , you can also provide a local path to via Evaluating GGUF Models supports evaluating models in GGUF format using the Hugging Face ( ) backend. This allows you to use quantized models compatible with…