BullshitBench

A benchmark measuring whether AI models challenge nonsensical prompts rather than confidently answering them, featuring 100 questions across 5 domains with a 3-tier judgment system and multi-judge panel.

BullshitBench is an open-source benchmark focused on evaluating LLMs' "nonsense detection" capability. Its core approach sends carefully crafted nonsensical prompts to models—such as referencing non-existent frameworks, nesting meaningless concepts, or setting specificity traps—and evaluates whether models can identify and reject false premises rather than confidently fabricating answers.

The current v2 version contains 100 questions across 5 domains: software (40), finance (15), law (15), medicine (15), and physics (15), employing 13 different nonsense construction techniques. Judgment uses a 3-tier classification: Clear Pushback, Partial Challenge, and Accepted Nonsense. The evaluation panel consists of 3 judge models—Claude Sonnet 4.6, GPT-5.2, and Gemini 3.1 Pro Preview—using mean aggregation for final scores.

The project also supports reasoning intensity scanning, testing the same model across low/medium/high/xhigh reasoning parameters to reveal whether "deeper thinking" improves nonsense detection. An interactive visualization viewer provides 6 analysis views including model detection rate rankings, domain landscapes, temporal trends, reasoning intensity correlation, and model size scatter plots, covering 142 model/reasoning configuration rows.

The pipeline follows four stages: collect → grade → grade-panel → publish, supporting phased execution and resumption. It natively integrates OpenRouter and OpenAI providers with high-concurrency collection and automatic rate-limit retry capabilities. The project is MIT-licensed and directly maintained by Peter Gostev.

Related Projects

Zylos Core

verl

Kalshi AI Trading Bot

STAY UPDATED