open-compass / VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

3,932 stars

656 forks

226 issues

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing open-compass/VLMEvalKit in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/open-compass/VLMEvalKit)

Preview:

Repository Overview (README excerpt)

Crawler view

A Toolkit for Evaluating Large Vision-Language Models. [![][github-contributors-shield]][github-contributors-link] • [![][github-forks-shield]][github-forks-link] • [![][github-stars-shield]][github-stars-link] • [![][github-issues-shield]][github-issues-link] • [![][github-license-shield]][github-license-link] English | 简体中文 | 日本語 🏆 OC Learderboard • 🏗️Quickstart • 📊Datasets & Models • 🛠️Development 🤗 HF Leaderboard • 🤗 Evaluation Records • 🤗 HF Video Leaderboard • 🔊 Discord • 📝 Report • 🎯Goal • 🖊️Citation **VLMEvalKit** (the python package name is **vlmeval**) is an **open-source evaluation toolkit** of **large vision-language models (LVLMs)**. It enables **one-command evaluation** of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt **generation-based evaluation** for all LVLMs, and provide the evaluation results obtained with both **exact matching** and **LLM-based answer extraction**. Recent Codebase Changes • **[2025-09-12]** **Major Update: Improved Handling for Models with Thinking Mode** A new feature in PR 1229 that improves support for models with thinking mode. VLMEvalKit now allows for the use of a custom function. **We strongly recommend this for models with thinking mode to ensure the accuracy of evaluation**. To use this new functionality, please enable the Environment Variable: . By default, the function will parse content within tags and store it in the key of the output. For more advanced customization, you can also create a function for model. Please see the InternVL implementation for an example. • **[2025-09-12]** **Major Update: Improved Handling for Long Response(More than 16k/32k)** A new feature in PR 1229 that improves support for models with long response outputs. VLMEvalKit can now save prediction files in TSV format. **Since individual cells in an file are limited to 32,767 characters, we strongly recommend using this feature for models that generate long responses (e.g., exceeding 16k or 32k tokens) to prevent data truncation.** To use this new functionality, please enable the Environment Variable: . • **[2025-08-04]** In PR 1175, we refine the and , which increasingly route the evaluation to LLM choice extractors and empirically leads to slight performance improvement for MCQ benchmarks. 🆕 News • **[2025-07-07]** Supported **SeePhys**, which is a full spectrum multimodal benchmark for evaluating physics reasoning across different knowledge levels. thanks to **Quinn777** 🔥🔥🔥 • **[2025-07-02]** Supported **OvisU1**, thanks to **liyang-7** 🔥🔥🔥 • **[2025-06-16]** Supported **PhyX**, a benchmark aiming to assess capacity for physics-grounded reasoning in visual scenarios. 🔥🔥🔥 • **[2025-05-24]** To facilitate faster evaluations for large-scale or thinking models, **VLMEvalKit supports multi-node distributed inference** using **LMDeploy** (supports *InternVL Series, QwenVL Series, LLaMa4*) or **VLLM**(supports *QwenVL Series, LLaMa4*). You can activate this feature by adding the or flag to your custom model configuration in config.py . Leverage these tools to significantly speed up your evaluation workflows 🔥🔥🔥 • **[2025-05-24]** Supported Models: **InternVL3 Series, Gemini-2.5-Pro, Kimi-VL, LLaMA4, NVILA, Qwen2.5-Omni, Phi4, SmolVLM2, Grok, SAIL-VL-1.5, WeThink-Qwen2.5VL-7B, Bailingmm, VLM-R1, Taichu-VLR**. Supported Benchmarks: **HLE-Bench, MMVP, MM-AlignBench, Creation-MMBench, MM-IFEval, OmniDocBench, OCR-Reasoning, EMMA, ChaXiv，MedXpertQA, Physics, MSEarthMCQ, MicroBench, MMSci, VGRP-Bench, wildDoc, TDBench, VisuLogic, CVBench, LEGO-Puzzles, Video-MMLU, QBench-Video, MME-CoT, VLM2Bench, VMCBench, MOAT, Spatial457 Benchmark**. Please refer to **VLMEvalKit Features** for more details. Thanks to all contributors 🔥🔥🔥 • **[2025-02-20]** Supported Models: **InternVL2.5 Series, Qwen2.5VL Series, QVQ-72B, Doubao-VL, Janus-Pro-7B, MiniCPM-o-2.6, InternVL2-MPO, LLaVA-CoT, Hunyuan-Standard-Vision, Ovis2, Valley, SAIL-VL, Ross, Long-VITA, EMU3, SmolVLM**. Supported Benchmarks: **MMMU-Pro, WeMath, 3DSRBench, LogicVista, VL-RewardBench, CC-OCR, CG-Bench, CMMMU, WorldSense**. Thanks to all contributors 🔥🔥🔥 • **[2024-12-11]** Supported **NaturalBench**, a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery. • **[2024-12-02]** Supported **VisOnlyQA**, a benchmark for evaluating the visual perception capabilities 🔥🔥🔥 • **[2024-11-26]** Supported **Ovis1.6-Gemma2-27B**, thanks to **runninglsy** 🔥🔥🔥 • **[2024-11-25]** Create a new flag . By setting this environment variable, you can download the video benchmarks supported from **modelscope** 🔥🔥🔥 🏗️ QuickStart See [QuickStart | 快速开始] for a quick start guide. 📊 Datasets, Models, and Evaluation Results Evaluation Results **The performance numbers on our official multi-modal leaderboards can be downloaded from here!** **OpenVLM Leaderboard**: **Download All DETAILED Results**. Check **Supported Benchmarks** Tab in **VLMEvalKit Features** to view all supported image & video benchmarks (70+). Check **Supported LMMs** Tab in **VLMEvalKit Features** to view all supported LMMs, including commercial APIs, open-source models, and more (200+). **Transformers Version Recommendation:** Note that some VLMs may not be able to run under certain transformer versions, we recommend the following settings to evaluate each VLM: • **Please use** **for**: , , , , , , , , , , , , . • **Please use** **for**: . • **Please use** **for**: , , , , , , , , , , , , , , , . • **Please use** **for**: , , , , , . • **Please use** **for**: . • **Please use** **for**: , . • **Please use** **for**: . • **Please use** (or ) **for**: (e.g., ). • **Please use** **for**: , , , , , , , , , , , , . • **Please use** (or or or ) **for**: . • **Please use** **for**: . **Torchvision Version Recommendation:** Note that some VLMs may not be able to run…