AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing UKGovernmentBEIS/inspect_evals in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler viewWelcome to **Inspect Evals**, a repository of community contributed LLM evaluations for Inspect AI. Inspect Evals was created in collaboration by the UK AISI, Arcadia Impact, and the Vector Institute. 📚 Documentation --- Community contributions are welcome and encouraged! Please see the Contributor Guide for details on submitting new evaluations. If you have general inquiries, suggestions, or expressions of interest, please contact us through the following Google Form. This repository is maintained by the Inspect Evals maintainers. Getting Started The recommended version of Python for Inspect Evals is 3.11 or 3.12. You should be able to run all evals on these versions and also develop the codebase without any issues. You can install and pin a specific Python version by running: As for Python 3.13, you should be able to run all evals except (its dependency is which currently does not support 3.13+). Development should work under 3.13, however it's relatively untested — if you run into issues, let us know. When it comes Python 3.14, at the time of writing this, many packages have yet to release versions for 3.14, so it's unsupported. The major one used by some Inspect Evals is . If you find running succeeding on 3.14, let us know and we'll remove this paragraph. Below, you can see a workflow for a typical eval. Some of the evaluations require additional dependencies or installation steps. If your eval needs extra dependencies, instructions for installing in the README file in the eval's subdirectory. Usage Installation There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository. If you are using it from pypi, install the package and its dependencies via: If you are using Inspect Evals in its repository, start by installing the necessary dependencies with: Running evaluations Now you can start evaluating models. For simplicity's sake, this section assumes you are using Inspect Evals from the standalone repo. If that's not the case and you are not using to manage dependencies in your own project, you can use the same commands with dropped. To run multiple tasks simulteneously use : You can also import tasks as normal Python objects and run them from python: After running evaluations, you can view their logs using the command: For VS Code, you can also download Inspect AI extension for viewing logs. If you don't want to specify the each time you run an evaluation, create a configuration file in your working directory that defines the environment variable along with your API key. For example: Inspect supports many model providers including OpenAI, Anthropic, Google, Mistral, Azure AI, AWS Bedrock, Together AI, Groq, Hugging Face, vLLM, Ollama, and more. See the Model Providers documentation for additional details. You might also be able to use a newer version of pip (25.1+) to install the project via or . However this is not officially supported. Documentation For details on building the documentation, see the documentation guide. For information on running tests and CI toggles, see the Technical Contribution Guide in CONTRIBUTING.md. Hardware recommendations Disk We recommend having at least 35 GB of free disk space for Inspect Evals: the full installation takes about 10 GB and you'll also need some space for uv cache and datasets cache (most are small, but some take 13 GB such as MMIU). Running some evals (e.g., CyBench, GDM capabilities evals) may require extra space beyond this because they pull Docker images. We recommend having at least 65 GB of extra space for running evals that have Dockerfiles in their file tree (though you might get away with less space) on top of the 35 GB suggestion above. In total, you should be comfortable running evals with 100 GB of free space. If you end up running of out space while having 100+ GB of free space available, please let us know — this might be a bug. RAM The amount of memory needed for an eval varies significantly with the eval. You'll be able to run most evals with only 0.5 GB of free RAM. However, some evals with larger datasets require 2-3 GB or more. And some evals that use Docker (e.g., some GDM capabilities evals) require up to 32 GB of RAM. Harbor Framework Evaluations For running evaluations from the Harbor Framework (e.g. Terminal-Bench 2.0, SWE-Bench Pro), use the Inspect Harbor package, which provides an interface to run Harbor tasks using Inspect AI. List of Evals Coding • ### APPS: Automated Programming Progress Standard APPS is a dataset for evaluating model performance on Python programming tasks across three difficulty levels consisting of 1,000 at introductory, 3,000 at interview, and 1,000 at competition level. The dataset consists of an additional 5,000 training samples, for a total of 10,000 total samples. We evaluate on questions from the test split, which consists of programming problems commonly found in coding interviews. Contributed by: @camtice • ### AgentBench: Evaluate LLMs as Agents A benchmark designed to evaluate LLMs as Agents Contributed by: @Felhof, @hannagabor, @shaheenahmedc • ### BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions Python coding benchmark with 1,140 diverse questions drawing on numerous python libraries. Contributed by: @tim-hua-01 • ### CORE-Bench Evaluate how well an LLM Agent is at computationally reproducing the results of a set of scientific papers. Contributed by: @enerrio • ### ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation Evaluates LLMs on class-level code generation with 100 tasks constructed over 500 person-hours. The study shows that LLMs perform worse on class-level tasks compared to method-level tasks. Contributed by: @zhenningdavidliu • ### ComputeEval: CUDA Code Generation Benchmark Evaluates LLM capability to generate correct CUDA code for kernel implementati…