modelscope / evalscope
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
View on GitHubAI Architecture Analysis
This repository is indexed by RepoMind. By analyzing modelscope/evalscope in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler viewδΈζ   ο½   English   π δΈζζζ‘£   ο½   π English Documentation > β If you like this project, please click the "Star" button in the upper right corner to support us. Your support is our motivation to move forward! π Introduction EvalScope is a powerful and easily extensible model evaluation framework created by the ModelScope Community, aiming to provide a one-stop evaluation solution for large model developers. Whether you want to evaluate the general capabilities of models, conduct multi-model performance comparisons, or need to stress test models, EvalScope can meet your needs. β¨ Key Features β’ **π Comprehensive Evaluation Benchmarks**: Built-in multiple industry-recognized evaluation benchmarks including MMLU, C-Eval, GSM8K, and more. β’ **π§© Multi-modal and Multi-domain Support**: Supports evaluation of various model types including Large Language Models (LLM), Vision Language Models (VLM), Embedding, Reranker, AIGC, and more. β’ **π Multi-backend Integration**: Seamlessly integrates multiple evaluation backends including OpenCompass, VLMEvalKit, RAGEval to meet different evaluation needs. β’ **β‘ Inference Performance Testing**: Provides powerful model service stress testing tools, supporting multiple performance metrics such as TTFT, TPOT. β’ **π Interactive Reports**: Provides WebUI visualization interface, supporting multi-dimensional model comparison, report overview and detailed inspection. β’ **βοΈ Arena Mode**: Supports multi-model battles (Pairwise Battle), intuitively ranking and evaluating models. β’ **π§ Highly Extensible**: Developers can easily add custom datasets, models and evaluation metrics. ποΈ Overall Architecture EvalScope Overall Architecture. β’ **Input Layer** β’ **Model Sources**: API models (OpenAI API), Local models (ModelScope) β’ **Datasets**: Standard evaluation benchmarks (MMLU/GSM8k etc.), Custom data (MCQ/QA) β’ **Core Functions** β’ **Multi-backend Evaluation**: Native backend, OpenCompass, MTEB, VLMEvalKit, RAGAS β’ **Performance Monitoring**: Supports multiple model service APIs and data formats, tracking TTFT/TPOP and other metrics β’ **Tool Extensions**: Integrates Tool-Bench, Needle-in-a-Haystack, etc. β’ **Output Layer** β’ **Structured Reports**: Supports JSON, Table, Logs β’ **Visualization Platform**: Supports Gradio, Wandb, SwanLab π What's New > [!IMPORTANT] > **Version 1.0 Refactoring** > > Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under . Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations. β’ π₯ **[2026.03.09]** Added support for evaluation progress tracking and HTML format visualization report generation. β’ π₯ **[2026.03.02]** Added support for Anthropic Claude API evaluation. Use to evaluate models via Anthropic API service. β’ π₯ **[2026.02.03]** Comprehensive update to dataset documentation, adding data statistics, data samples, usage instructions and more. Refer to Supported Datasets β’ π₯ **[2026.01.13]** Added support for Embedding and Rerank model service stress testing. Refer to the usage documentation. β’ π₯ **[2025.12.26]** Added support for Terminal-Bench-2.0, which evaluates AI Agent performance on 89 real-world multi-step terminal tasks. Refer to the usage documentation. β’ π₯ **[2025.12.18]** Added support for SLA auto-tuning model API services, automatically testing the maximum concurrency of model services under specific latency, TTFT, and throughput conditions. Refer to the usage documentation. β’ π₯ **[2025.12.16]** Added support for audio evaluation benchmarks such as Fleurs, LibriSpeech; added support for multilingual code evaluation benchmarks such as MultiplE, MBPP. β’ π₯ **[2025.12.02]** Added support for custom multimodal VQA evaluation; refer to the usage documentation. Added support for visualizing model service stress testing in ClearML; refer to the usage documentation. β’ π₯ **[2025.11.26]** Added support for OpenAI-MRCR, GSM8K-V, MGSM, MicroVQA, IFBench, SciCode benchmarks. β’ π₯ **[2025.11.18]** Added support for custom Function-Call (tool invocation) datasets to test whether models can timely and correctly call tools. Refer to the usage documentation. β’ π₯ **[2025.11.14]** Added support for SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini code evaluation benchmarks. Refer to the usage documentation. β’ π₯ **[2025.11.12]** Added , , and other metric aggregation methods; added support for multimodal evaluation benchmarks such as A_OKVQA, CMMU, ScienceQA, V*Bench. β’ π₯ **[2025.11.07]** Added support for ΟΒ²-bench, an extended and enhanced version of Ο-bench that includes a series of code fixes and adds telecom domain troubleshooting scenarios. Refer to the usage documentation. β’ π₯ **[2025.10.30]** Added support for BFCL-v4, enabling evaluation of agent capabilities including web search and long-term memory. See the usage documentation. β’ π₯ **[2025.10.27]** Added support for LogiQA, HaluEval, MathQA, MRI-QA, PIQA, QASC, CommonsenseQA and other evaluation benchmarks. Thanks to @penguinwang96825 for the code implementation. β’ π₯ **[2025.10.26]** Added support for Conll-2003, CrossNER, Copious, GeniaNER, HarveyNER, MIT-Movie-Trivia, MIT-Restaurant, OntoNotes5, WNUT2017 and other Named Entity Recognition evaluation benchmarks. Thanks to @penguinwang96825 for the code implementation. β’ π₯ **[2025.10.21]** Optimized sandbox environment usage in code evaluation, supporting both local and remote operation modes. For details, refer to the documentation. β’ π₯ **[2025.10.20]** Added support for evaluation benchmarks including PolyMath, SimpleVQA, MathVerse, MathVision, AA-LCR; optimized evalscope perf performβ¦