Overview
The Artificial Analysis Intelligence Index v4.0 is a comprehensive benchmark that evaluates AI models across 10 diverse and challenging evaluations. It represents a holistic assessment of artificial intelligence capabilities, covering reasoning, knowledge, coding, instruction following, and scientific understanding.
Component Evaluations
This intelligence index aggregates performance from the following benchmarks:
| Category | Benchmark | Description |
|---|---|---|
| Reasoning | 𝜏²-Bench Telecom | Dual-control conversational AI benchmark for technical support scenarios |
| Agent | Terminal-Bench Hard | AI capabilities in terminal environments |
| Coding | SciCode | Scientist-curated coding problems from laboratory settings |
| Reasoning | AA-LCR | Long Context Reasoning benchmark |
| Knowledge | AA-Omniscience | General knowledge assessment |
| Instruction-Following | IFBench | Precise instruction-following generalization |
| Academic | Humanity’s Last Exam | Multi-modal benchmark at the frontier of human knowledge |
| Scientific | GPQA Diamond | Graduate-level scientific Q&A |
| Agent | CritPt | Critical thinking and problem-solving |
| General | GDPval-AA | General domain performance validation |
Purpose
This composite index provides a comprehensive view of an AI model’s intelligence by evaluating performance across multiple dimensions. It helps identify well-rounded models that excel in diverse challenges rather than specializing in narrow domains.
Source: Artificial Analysis