Benchmark All In One

The open-source hub for AI model evaluations.
Find your model across 15 benchmarks and 400 models.

Explore all Models Explore all Benchmarks Explore all Publishers

Best Models

1		Gemini 3 Pro Preview (high) 70
2		GPT-5.2 (xhigh) 70
3		GPT-5 (high) 68
4		Claude Opus 4.5 (Reasoning) 67
5		Gemini 3 Flash Preview (Reasoning) 67
6		GPT-5.1 (high) 66
7		GPT-5 (medium) 65
8		GPT-5.2 (medium) 64
9		GPT-5 Codex (high) 64
10		o3 64

Trending Benchmarks

View All

AIME 2025

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999

Reasoning

Artificial Analysis Coding Index

Represents the average of coding benchmarks in the Artificial Analysis Intelligence Index (Terminal-Bench Hard, SciCode)

Coding

Artificial Analysis Intelligence Index

Artificial Analysis Intelligence Index v4.0 incorporates 10 evaluations: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt

Knowledge

GPQA Diamond

A more robust and challenging version of the MMLU benchmark

Knowledge

HLE

Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading

Knowledge Multi-Modal

IFBench

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements

Agent