AGI Systems Intelligence Archive

Benchmark Situation Room

A living dossier of the strongest public numbers from frontier labs. Drop each new release into the log with a short narrative, headline metrics, and links to the source material. The tables below help compare results across model families and track directional momentum.

Frontier release log

9 tracked launches

OpenAI/Mar 5, 2026

GPT-5.4

OpenAI launches GPT-5.4 with major gains on professional work, tool use, browsing, and ARC-style reasoning benchmarks.

GPT-5.4 posts 83.0% on GDPval, 57.7% on SWE-Bench Pro, 75.1% on Terminal-Bench 2.0, 82.7% on BrowseComp, and 75.0% on OSWorld-Verified. OpenAI also reports 93.7% on ARC-AGI-1 and 73.3% on ARC-AGI-2, while the ARC Prize leaderboard tracks separate GPT-5.4 reasoning levels that currently climb as high as 98.25% on ARC-AGI-1 public eval and 92.21% on ARC-AGI-2 public eval for GPT-5.4 Pro xHigh.

Launch post ARC Prize leaderboard

Headline metrics

GDPval:83.0%
SWE-Bench Pro:57.7%
ARC-AGI-2:73.3%
OpenAI reported
BrowseComp:82.7%

Evaluation highlights

Terminal-Bench 2.0

75.1%

OSWorld-Verified

75.0%

Toolathlon

54.6%

ARC Prize · ARC-AGI-2 Public

92.21%

GPT-5.4 Pro xHigh leaderboard entry

Google/Feb 19, 2026

Gemini 3.1 Pro

Google's Gemini 3.1 Pro takes the #1 spot on the Artificial Analysis Intelligence Index and leads 12 of 18 tracked benchmarks.

Gemini 3.1 Pro represents a massive leap in novel problem-solving, jumping from 31.1% to 77.1% on ARC-AGI-2. It achieves 94.3% on GPQA Diamond (best in class) and 2887 Elo on LiveCodeBench Pro, signaling Google's dominance in reasoning and code generation.

Launch post

Headline metrics

GPQA Diamond:94.3%
Best in class across all models
ARC-AGI-2:77.1%
Up from 31.1% (Gemini 3 Pro)
LiveCodeBench Pro:2887 Elo

Evaluation highlights

GPQA Diamond

94.3%

vs 92.4% (GPT-5.2), 91.3% (Opus 4.6)

ARC-AGI-2

77.1%

2.5x improvement over Gemini 3 Pro

Intelligence Index

#1 (57)

Ahead of GPT-5.3 Codex (54)

Anthropic/Feb 17, 2026

Claude Sonnet 4.6

Anthropic releases Sonnet 4.6 with 1M context, outperforming Opus 4.6 on some real-world tasks at the same price as Sonnet 4.5.

Claude Sonnet 4.6 delivers a 4.3x improvement on ARC-AGI-2 novel problem-solving (58.3% vs 13.6%) and 79.6% on SWE-bench Verified. It is the first Sonnet-class model with a 1M-token context window, and outperforms the just-released Opus 4.6 on some office tasks, proving the mid-tier is closing the gap.

Launch post

Headline metrics

SWE-bench Verified:79.6%
OSWorld:72.5%
ARC-AGI-2:58.3%
4.3x improvement from 13.6%

Evaluation highlights

Agentic coding · SWE-bench Verified

79.6%

vs 80.8% (Opus 4.6)

ARC-AGI-2

58.3%

4.3x jump from previous Sonnet

OSWorld (computer use)

72.5%

Major computer-use improvement

GPQA Diamond

74.1%

Anthropic/Feb 5, 2026

Claude Opus 4.6

Anthropic launches Opus 4.6 with 1M-token context, agent teams, and perfect AIME 2025 score.

Claude Opus 4.6 introduces multi-agent collaboration and a 1M-token context window. It achieves 80.8% on SWE-bench Verified, a perfect score on AIME 2025, and 91.3% on GPQA Diamond. The 1M context window scores 76% on MRCR v2 (8-needle variant) vs just 18.5% for the previous Opus.

Launch post

Headline metrics

SWE-bench Verified:80.8%
AIME 2025:100%
GPQA Diamond:91.3%
Terminal-Bench 2.0:65.4%
Highest score on agentic coding

Evaluation highlights

Agentic coding · SWE-bench Verified

80.8%

Terminal-Bench 2.0

65.4%

AIME 2025

100%

Perfect score

GPQA Diamond

91.3%

MMLU

91%

MRCR v2 (1M, 8-needle)

76%

vs 18.5% for Opus 4.5

OpenAI/Feb 5, 2026

GPT-5.3 Codex

OpenAI launches GPT-5.3-Codex, the first model combining Codex + GPT-5 training stacks with new state-of-the-art coding benchmarks.

GPT-5.3-Codex sets new state-of-the-art on Terminal-Bench 2.0 (77.3%), OSWorld-Verified (64.7%), and SWE-Lancer IC Diamond (81.4%). It is ~25% faster than GPT-5.2 and brings together frontier code generation, reasoning, and general-purpose intelligence in one unified model.

Launch post

Headline metrics

Terminal-Bench 2.0:77.3%
New state of the art
OSWorld-Verified:64.7%
SWE-Lancer IC Diamond:81.4%

Evaluation highlights

Terminal-Bench 2.0

77.3%

vs 65.4% (Opus 4.6)

OSWorld-Verified

64.7%

AIME 2025

94%

MATH

96%

Google/Dec 17, 2025

Gemini 3 Flash

Gemini 3 Flash replaces 2.5 Flash as default, outperforming 2.5 Pro at 3x the speed and a fraction of the cost.

Gemini 3 Flash achieves 78% on SWE-bench Verified, outperforming not only the entire 2.5 series but even Gemini 3 Pro on coding tasks. It is 3x faster than 2.5 Pro based on Artificial Analysis benchmarking and becomes the default model in the Gemini app globally.

Launch post

Headline metrics

SWE-bench Verified:78%
Beats Gemini 3 Pro
Intelligence Index:46
Reasoning model tier

Evaluation highlights

SWE-bench Verified

78%

Best in Gemini family

OpenAI/Dec 11, 2025

GPT-5.2

OpenAI's GPT-5.2 sets a new state of the art on SWE-Bench Pro and crosses the 90% ARC-AGI-1 threshold.

GPT-5.2 Thinking scores 80% on SWE-bench Verified and 55.6% on SWE-Bench Pro (state of the art). GPT-5.2 Pro achieves 93.2% on GPQA Diamond and is the first model to cross 90% on ARC-AGI-1, reducing cost by 390x compared to o3-preview.

Launch post

Headline metrics

SWE-bench Verified:80%
GPT-5.2 Thinking
SWE-Bench Pro:55.6%
State of the art
GPQA Diamond:93.2%
GPT-5.2 Pro
ARC-AGI-1:>90%
First model to cross 90% threshold

Evaluation highlights

SWE-bench Verified

80%

GPT-5.2 Thinking

GPQA Diamond

92.4-93.2%

Thinking vs Pro variants

ARC-AGI-1

>90%

390x cheaper than o3-preview

Mistral/Dec 2, 2025

Mistral Large 3

Mistral releases Large 3, a 675B open-weight MoE under Apache 2.0 with 256K context and multimodal understanding.

Mistral Large 3 debuts at #2 in OSS non-reasoning models on LMArena with 41B active / 675B total parameters. It delivers image understanding, best-in-class multilingual conversations, and generates output at 49.1 tokens/sec. Priced at $0.50/$1.50 per million tokens, making it extremely competitive.

Launch post

Headline metrics

LMArena OSS ranking:#2
Non-reasoning models
Output speed:49.1 tok/s

Evaluation highlights

LMArena

#2 OSS non-reasoning

#6 overall OSS

Anthropic/Mar 27, 2025

Claude Sonnet 4.5

Anthropic upgrades the Claude Sonnet line with large gains on software engineering and multi-step tool use.

Claude Sonnet 4.5 posts the strongest SWE-bench Verified numbers yet, while narrowing the gap with GPT-5 on reasoning-heavy tasks. Parallel test-time compute remains the differentiator for the lab, and early customer feedback highlights impressive math performance without delegated tools.

Launch post Eval deck

Headline metrics

SWE-bench Verified:77.2%
82.0% with parallel test-time compute
Toolformer (t2-bench):86.2% retail · 70.0% airline · 98.0% telecom
OSWorld (computer use):61.4%
Up 17 pts vs Claude 4

Evaluation highlights

Agentic coding · SWE-bench Verified

77.2%

Claude 4.5 vs 74.5% (Opus 4.1)

Terminal-Bench

50.0%

Best-in-class agentic terminal coding

AIME 2025 (Python)

100%

87.0% without external tools

GPQA Diamond

83.4%

+2.3 pts vs Claude 4

Finance Agent

55.3%

Best public number to date

Cross-lab benchmark ledger

11 suites tracked

Suite	Benchmark	Results
SWE-bench Verified	Agentic coding	Claude Opus 4.6:80.8% GPT-5.2 Thinking:80.0% Claude Sonnet 4.6:79.6% Gemini 3 Flash:78.0%Beats Gemini 3 Pro Claude Sonnet 4.5:77.2%82.0% with parallel compute Claude Haiku 4.5:73.3% GPT-5:72.8% Grok 4:~72% o3:71.7% DeepSeek V3.2:67.8% Gemini 2.5 Pro:67.2%
Terminal-Bench 2.0	Agentic terminal coding	GPT-5.3 Codex:77.3%State of the art Claude Opus 4.6:65.4% Claude Sonnet 4.5:50.0% Claude Opus 4.1:46.5% GPT-5:43.8% Claude Sonnet 4:36.4% Gemini 2.5 Pro:25.3%
ARC-AGI-2	ARC reasoning	GPT-5.4 Pro xHigh:92.21%ARC Prize public eval leaderboard GPT-5.4 xHigh:84.17%ARC Prize public eval leaderboard Gemini 3.1 Pro:77.1% GPT-5.4:73.3%OpenAI reported launch eval Claude Sonnet 4.6:58.3%
t2-bench	Agentic tool use	Claude Sonnet 4.5:Retail 86.2% · Airline 70.0% · Telecom 98.0% Claude Opus 4.1:Retail 86.8% · Airline 63.0% · Telecom 71.5% Claude Sonnet 4:Retail 83.8% · Airline 63.0% · Telecom 49.6% GPT-5:Retail 81.1% · Airline 62.6% · Telecom 96.7%
OSWorld	Computer use	Claude Sonnet 4.6:72.5% Claude Opus 4.6:72.7% GPT-5.3 Codex:64.7%OSWorld-Verified Claude Sonnet 4.5:61.4% Claude Opus 4.1:44.4% Claude Sonnet 4:42.2%
AIME 2025	High school math competition	Claude Opus 4.6:100% Claude Sonnet 4.5:100% (python) · 87.0% (no tools) DeepSeek V3.2:96.0% o3:96.7%AIME 2024 GPT-5:99.6% (python) · 94.6% (no tools) GPT-5.3 Codex:94% Grok 3:93.3% o4-mini:92.7%99.5% with Python Gemini 2.5 Pro:88.0%
GPQA Diamond	Graduate-level reasoning	Gemini 3.1 Pro:94.3%Best in class GPT-5.2 Pro:93.2% GPT-5.2 Thinking:92.4% Claude Opus 4.6:91.3% Grok 4:88.0% o3:87.7% Gemini 2.5 Pro:86.4% GPT-5:85.7% Grok 3:84.6% Claude Sonnet 4.5:83.4% DeepSeek V3.2:82.4% Claude Opus 4.1:81.0% o4-mini:81.4% Claude Sonnet 4.6:74.1% DeepSeek R1:71.5%
MMLU	Multilingual QA	Grok 3:92.7% Llama 4 Maverick:92.4% Claude Opus 4.6:91.0% DeepSeek R1:90.8% Claude Opus 4.1:89.5% GPT-5:89.4% Claude Sonnet 4.5:89.1% GPT-4o:88.7% DeepSeek V3:88.5% Llama 4 Scout:87.2% Claude Sonnet 4:86.5%
ARC-AGI-2	Novel problem-solving	Gemini 3.1 Pro:77.1%Up from 31.1% (3 Pro) Claude Sonnet 4.6:58.3%4.3x improvement from 13.6%
MMMU (validation)	Visual reasoning	GPT-5:84.2% Gemini 2.5 Pro:82.0% Gemini 3 Pro:81.0%MMMU-Pro Claude Sonnet 4.5:77.8% Claude Opus 4.1:77.1% Claude Sonnet 4:74.4% Llama 4 Maverick:73.4%
Finance Agent	Financial analysis	Claude Sonnet 4.5:55.3% Claude Opus 4.1:50.9% GPT-5:46.9% Claude Sonnet 4:44.5% Gemini 2.5 Pro:29.4%