back to home

petergpt / bullshit-benchmark

BullshitBench measures whether AI models challenge nonsensical prompts instead of confidently answering them, created by Peter Gostev.

View on GitHub
1,224 stars
48 forks
11 issues

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing petergpt/bullshit-benchmark in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/petergpt/bullshit-benchmark)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

BullshitBench v2 BullshitBench measures whether models detect nonsense, call it out clearly, and avoid confidently continuing with invalid assumptions. • Public viewer (latest): https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html • Updated: 2026-03-12 Latest Changelog Entry (2026-03-12) • Added benchmark runs for the new Grok 4.20 variants across both published datasets: • - • Published the Grok 4.20 rows into both viewer tracks: • ( ) with questions • ( ) with questions • Simplified the visible model labels in the viewers by dropping the suffix from the Grok 4.20 display names while keeping the underlying model IDs unchanged. • Refined the main chart row-selection treatment to make model selection easier to see without overpowering the chart. • Updated org color mapping so renders in black and renders in green in the viewers. • Added click-to-pin labels for scatter-chart dots in the v2 viewer so specific models can be called out on demand. • Full details: CHANGELOG.md v2 Changelog Highlights • new nonsense questions in the v2 set. • Domain-specific question coverage across domains: (40), (15), (15), (15), (15). • New visualizations in the v2 viewer, including: • Detection Rate by Model (stacked mix bars) • Domain Landscape (overall vs domain detection mix) • Detection Rate Over Time • Do Newer Models Perform Better? • Does Thinking Harder Help? (tokens/cost toggle) Viewer Walkthrough (v2) The screenshots below follow the same flow as , starting with the main chart. • Detection Rate by Model (Main Chart) Primary leaderboard-style view showing each model's green/amber/red split. • Domain Landscape Detection mix by domain to compare overall performance vs each domain at a glance. • Detection Rate Over Time Release-date trend view focused on Anthropic, OpenAI, and Google. • Do Newer Models Perform Better? All-model scatter by release date vs. green rate. • Does Thinking Harder Help? Reasoning scatter (tokens/cost toggle in the viewer) vs. green rate. Benchmark Scope (v2) • nonsense prompts total. • domain groups: (40), (15), (15), (15), (15). • nonsense techniques (for example: , , , ). • -judge panel aggregation ( , , ) using panel mode + aggregation. • Published v2 leaderboard currently includes model/reasoning rows. What This Measures • : the model clearly rejects the broken premise. • : the model flags issues but still engages the bad premise. • : the model treats the nonsense as valid. Quick Start • Set API keys: Provider routing is configured per model via and in config (default is OpenRouter), for example: . • Run collection + primary judge (Claude by default): • Run v2 end-to-end and publish into the dedicated v2 dataset: • Optionally run the default config end-to-end (publishes to ): • Open the viewer: • Published viewer (latest): https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html • Local viewer (optional): Then open . Use the dropdown in the filters panel to switch between published datasets (for example and ). Published Datasets • v1 dataset remains in . • v2 dataset is published in . • v2 question set comes from via . • Canonical judging is now fixed to exactly 3 judges on every row with mean aggregation (legacy disagreement-tiebreak mode is retired from the main pipeline). • Release notes and notable changes are tracked in . Documentation • Technical Guide: pipeline operations, publishing artifacts, launch-date metadata workflow, repo layout, env vars. • Changelog: v1 to v2 release notes and publish-history highlights. • Question Set: benchmark questions and scoring metadata. • Question Set v2: v2 question pool generated from . • Config: default model/pipeline settings. • Config v2: v2-ready config (uses ). Notes • This README is intentionally audience-facing. • Technical and maintainer-oriented content lives in . License MIT. See LICENSE. Star History