AGI Systems Intelligence Archive

Benchmark Situation Room

A living dossier of the strongest public numbers from frontier labs. Drop each new release into the log with a short narrative, headline metrics, and links to the source material. The tables below help compare results across model families and track directional momentum.

Frontier release log

9 tracked launches
OpenAI/Mar 5, 2026

GPT-5.4

OpenAI launches GPT-5.4 with major gains on professional work, tool use, browsing, and ARC-style reasoning benchmarks.

GPT-5.4 posts 83.0% on GDPval, 57.7% on SWE-Bench Pro, 75.1% on Terminal-Bench 2.0, 82.7% on BrowseComp, and 75.0% on OSWorld-Verified. OpenAI also reports 93.7% on ARC-AGI-1 and 73.3% on ARC-AGI-2, while the ARC Prize leaderboard tracks separate GPT-5.4 reasoning levels that currently climb as high as 98.25% on ARC-AGI-1 public eval and 92.21% on ARC-AGI-2 public eval for GPT-5.4 Pro xHigh.

Headline metrics

  • GDPval:83.0%

  • SWE-Bench Pro:57.7%

  • ARC-AGI-2:73.3%

    OpenAI reported

  • BrowseComp:82.7%

Evaluation highlights

Terminal-Bench 2.0

75.1%

OSWorld-Verified

75.0%

Toolathlon

54.6%

ARC Prize · ARC-AGI-2 Public

92.21%

GPT-5.4 Pro xHigh leaderboard entry

Google/Feb 19, 2026

Gemini 3.1 Pro

Google's Gemini 3.1 Pro takes the #1 spot on the Artificial Analysis Intelligence Index and leads 12 of 18 tracked benchmarks.

Gemini 3.1 Pro represents a massive leap in novel problem-solving, jumping from 31.1% to 77.1% on ARC-AGI-2. It achieves 94.3% on GPQA Diamond (best in class) and 2887 Elo on LiveCodeBench Pro, signaling Google's dominance in reasoning and code generation.

Headline metrics

  • GPQA Diamond:94.3%

    Best in class across all models

  • ARC-AGI-2:77.1%

    Up from 31.1% (Gemini 3 Pro)

  • LiveCodeBench Pro:2887 Elo

Evaluation highlights

GPQA Diamond

94.3%

vs 92.4% (GPT-5.2), 91.3% (Opus 4.6)

ARC-AGI-2

77.1%

2.5x improvement over Gemini 3 Pro

Intelligence Index

#1 (57)

Ahead of GPT-5.3 Codex (54)

Anthropic/Feb 17, 2026

Claude Sonnet 4.6

Anthropic releases Sonnet 4.6 with 1M context, outperforming Opus 4.6 on some real-world tasks at the same price as Sonnet 4.5.

Claude Sonnet 4.6 delivers a 4.3x improvement on ARC-AGI-2 novel problem-solving (58.3% vs 13.6%) and 79.6% on SWE-bench Verified. It is the first Sonnet-class model with a 1M-token context window, and outperforms the just-released Opus 4.6 on some office tasks, proving the mid-tier is closing the gap.

Headline metrics

  • SWE-bench Verified:79.6%

  • OSWorld:72.5%

  • ARC-AGI-2:58.3%

    4.3x improvement from 13.6%

Evaluation highlights

Agentic coding · SWE-bench Verified

79.6%

vs 80.8% (Opus 4.6)

ARC-AGI-2

58.3%

4.3x jump from previous Sonnet

OSWorld (computer use)

72.5%

Major computer-use improvement

GPQA Diamond

74.1%

Anthropic/Feb 5, 2026

Claude Opus 4.6

Anthropic launches Opus 4.6 with 1M-token context, agent teams, and perfect AIME 2025 score.

Claude Opus 4.6 introduces multi-agent collaboration and a 1M-token context window. It achieves 80.8% on SWE-bench Verified, a perfect score on AIME 2025, and 91.3% on GPQA Diamond. The 1M context window scores 76% on MRCR v2 (8-needle variant) vs just 18.5% for the previous Opus.

Headline metrics

  • SWE-bench Verified:80.8%

  • AIME 2025:100%

  • GPQA Diamond:91.3%

  • Terminal-Bench 2.0:65.4%

    Highest score on agentic coding

Evaluation highlights

Agentic coding · SWE-bench Verified

80.8%

Terminal-Bench 2.0

65.4%

AIME 2025

100%

Perfect score

GPQA Diamond

91.3%

MMLU

91%

MRCR v2 (1M, 8-needle)

76%

vs 18.5% for Opus 4.5

OpenAI/Feb 5, 2026

GPT-5.3 Codex

OpenAI launches GPT-5.3-Codex, the first model combining Codex + GPT-5 training stacks with new state-of-the-art coding benchmarks.

GPT-5.3-Codex sets new state-of-the-art on Terminal-Bench 2.0 (77.3%), OSWorld-Verified (64.7%), and SWE-Lancer IC Diamond (81.4%). It is ~25% faster than GPT-5.2 and brings together frontier code generation, reasoning, and general-purpose intelligence in one unified model.

Headline metrics

  • Terminal-Bench 2.0:77.3%

    New state of the art

  • OSWorld-Verified:64.7%

  • SWE-Lancer IC Diamond:81.4%

Evaluation highlights

Terminal-Bench 2.0

77.3%

vs 65.4% (Opus 4.6)

OSWorld-Verified

64.7%

AIME 2025

94%

MATH

96%

Google/Dec 17, 2025

Gemini 3 Flash

Gemini 3 Flash replaces 2.5 Flash as default, outperforming 2.5 Pro at 3x the speed and a fraction of the cost.

Gemini 3 Flash achieves 78% on SWE-bench Verified, outperforming not only the entire 2.5 series but even Gemini 3 Pro on coding tasks. It is 3x faster than 2.5 Pro based on Artificial Analysis benchmarking and becomes the default model in the Gemini app globally.

Headline metrics

  • SWE-bench Verified:78%

    Beats Gemini 3 Pro

  • Intelligence Index:46

    Reasoning model tier

Evaluation highlights

SWE-bench Verified

78%

Best in Gemini family

OpenAI/Dec 11, 2025

GPT-5.2

OpenAI's GPT-5.2 sets a new state of the art on SWE-Bench Pro and crosses the 90% ARC-AGI-1 threshold.

GPT-5.2 Thinking scores 80% on SWE-bench Verified and 55.6% on SWE-Bench Pro (state of the art). GPT-5.2 Pro achieves 93.2% on GPQA Diamond and is the first model to cross 90% on ARC-AGI-1, reducing cost by 390x compared to o3-preview.

Headline metrics

  • SWE-bench Verified:80%

    GPT-5.2 Thinking

  • SWE-Bench Pro:55.6%

    State of the art

  • GPQA Diamond:93.2%

    GPT-5.2 Pro

  • ARC-AGI-1:>90%

    First model to cross 90% threshold

Evaluation highlights

SWE-bench Verified

80%

GPT-5.2 Thinking

GPQA Diamond

92.4-93.2%

Thinking vs Pro variants

ARC-AGI-1

>90%

390x cheaper than o3-preview

Mistral/Dec 2, 2025

Mistral Large 3

Mistral releases Large 3, a 675B open-weight MoE under Apache 2.0 with 256K context and multimodal understanding.

Mistral Large 3 debuts at #2 in OSS non-reasoning models on LMArena with 41B active / 675B total parameters. It delivers image understanding, best-in-class multilingual conversations, and generates output at 49.1 tokens/sec. Priced at $0.50/$1.50 per million tokens, making it extremely competitive.

Headline metrics

  • LMArena OSS ranking:#2

    Non-reasoning models

  • Output speed:49.1 tok/s

Evaluation highlights

LMArena

#2 OSS non-reasoning

#6 overall OSS

Anthropic/Mar 27, 2025

Claude Sonnet 4.5

Anthropic upgrades the Claude Sonnet line with large gains on software engineering and multi-step tool use.

Claude Sonnet 4.5 posts the strongest SWE-bench Verified numbers yet, while narrowing the gap with GPT-5 on reasoning-heavy tasks. Parallel test-time compute remains the differentiator for the lab, and early customer feedback highlights impressive math performance without delegated tools.

Headline metrics

  • SWE-bench Verified:77.2%

    82.0% with parallel test-time compute

  • Toolformer (t2-bench):86.2% retail · 70.0% airline · 98.0% telecom

  • OSWorld (computer use):61.4%

    Up 17 pts vs Claude 4

Evaluation highlights

Agentic coding · SWE-bench Verified

77.2%

Claude 4.5 vs 74.5% (Opus 4.1)

Terminal-Bench

50.0%

Best-in-class agentic terminal coding

AIME 2025 (Python)

100%

87.0% without external tools

GPQA Diamond

83.4%

+2.3 pts vs Claude 4

Finance Agent

55.3%

Best public number to date

Cross-lab benchmark ledger

11 suites tracked
SuiteBenchmarkResults
SWE-bench VerifiedAgentic coding
  • Claude Opus 4.6:80.8%
  • GPT-5.2 Thinking:80.0%
  • Claude Sonnet 4.6:79.6%
  • Gemini 3 Flash:78.0%Beats Gemini 3 Pro
  • Claude Sonnet 4.5:77.2%82.0% with parallel compute
  • Claude Haiku 4.5:73.3%
  • GPT-5:72.8%
  • Grok 4:~72%
  • o3:71.7%
  • DeepSeek V3.2:67.8%
  • Gemini 2.5 Pro:67.2%
Terminal-Bench 2.0Agentic terminal coding
  • GPT-5.3 Codex:77.3%State of the art
  • Claude Opus 4.6:65.4%
  • Claude Sonnet 4.5:50.0%
  • Claude Opus 4.1:46.5%
  • GPT-5:43.8%
  • Claude Sonnet 4:36.4%
  • Gemini 2.5 Pro:25.3%
ARC-AGI-2ARC reasoning
  • GPT-5.4 Pro xHigh:92.21%ARC Prize public eval leaderboard
  • GPT-5.4 xHigh:84.17%ARC Prize public eval leaderboard
  • Gemini 3.1 Pro:77.1%
  • GPT-5.4:73.3%OpenAI reported launch eval
  • Claude Sonnet 4.6:58.3%
t2-benchAgentic tool use
  • Claude Sonnet 4.5:Retail 86.2% · Airline 70.0% · Telecom 98.0%
  • Claude Opus 4.1:Retail 86.8% · Airline 63.0% · Telecom 71.5%
  • Claude Sonnet 4:Retail 83.8% · Airline 63.0% · Telecom 49.6%
  • GPT-5:Retail 81.1% · Airline 62.6% · Telecom 96.7%
OSWorldComputer use
  • Claude Sonnet 4.6:72.5%
  • Claude Opus 4.6:72.7%
  • GPT-5.3 Codex:64.7%OSWorld-Verified
  • Claude Sonnet 4.5:61.4%
  • Claude Opus 4.1:44.4%
  • Claude Sonnet 4:42.2%
AIME 2025High school math competition
  • Claude Opus 4.6:100%
  • Claude Sonnet 4.5:100% (python) · 87.0% (no tools)
  • DeepSeek V3.2:96.0%
  • o3:96.7%AIME 2024
  • GPT-5:99.6% (python) · 94.6% (no tools)
  • GPT-5.3 Codex:94%
  • Grok 3:93.3%
  • o4-mini:92.7%99.5% with Python
  • Gemini 2.5 Pro:88.0%
GPQA DiamondGraduate-level reasoning
  • Gemini 3.1 Pro:94.3%Best in class
  • GPT-5.2 Pro:93.2%
  • GPT-5.2 Thinking:92.4%
  • Claude Opus 4.6:91.3%
  • Grok 4:88.0%
  • o3:87.7%
  • Gemini 2.5 Pro:86.4%
  • GPT-5:85.7%
  • Grok 3:84.6%
  • Claude Sonnet 4.5:83.4%
  • DeepSeek V3.2:82.4%
  • Claude Opus 4.1:81.0%
  • o4-mini:81.4%
  • Claude Sonnet 4.6:74.1%
  • DeepSeek R1:71.5%
MMLUMultilingual QA
  • Grok 3:92.7%
  • Llama 4 Maverick:92.4%
  • Claude Opus 4.6:91.0%
  • DeepSeek R1:90.8%
  • Claude Opus 4.1:89.5%
  • GPT-5:89.4%
  • Claude Sonnet 4.5:89.1%
  • GPT-4o:88.7%
  • DeepSeek V3:88.5%
  • Llama 4 Scout:87.2%
  • Claude Sonnet 4:86.5%
ARC-AGI-2Novel problem-solving
  • Gemini 3.1 Pro:77.1%Up from 31.1% (3 Pro)
  • Claude Sonnet 4.6:58.3%4.3x improvement from 13.6%
MMMU (validation)Visual reasoning
  • GPT-5:84.2%
  • Gemini 2.5 Pro:82.0%
  • Gemini 3 Pro:81.0%MMMU-Pro
  • Claude Sonnet 4.5:77.8%
  • Claude Opus 4.1:77.1%
  • Claude Sonnet 4:74.4%
  • Llama 4 Maverick:73.4%
Finance AgentFinancial analysis
  • Claude Sonnet 4.5:55.3%
  • Claude Opus 4.1:50.9%
  • GPT-5:46.9%
  • Claude Sonnet 4:44.5%
  • Gemini 2.5 Pro:29.4%