Weapons Assessment Division
Model Comparison Matrix
A comprehensive comparison of frontier AI models across capability benchmarks and practical deployment metrics. Sortable, searchable, filterable. Best-in-class values highlighted in red.
32 MODELS ON FILE
Showing 32 of 32 models
| Model | Org | SWE-bench | Terminal | AIME | GPQA | MMLU | MMMU | Finance | ELO▼ | Speed | In $/M | Out $/M | Context | Latency |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 80.8% | 65.4% | 100.0% | 91.3% | 91.0% | 77.0% | — | 1,503 | 46 t/s | $5 | $25 | 1M | 1900ms |
| Gemini 3.1 Pro | 80.6% | 68.5% | 91.2% | 94.3% | 92.6% | 80.5% | — | 1,501 | 84 t/s | $2 | $12 | 1M | 46400ms | |
| Grok 4 | xAI | 72.0% | — | 94.0% | 88.0% | — | 76.5% | — | 1,492 | 45 t/s | $3 | $15 | 260K | 15570ms |
| Gemini 3 Pro | 76.2% | — | 95.0% | 91.9% | 91.8% | 81.0% | — | 1,486 | 117 t/s | $2 | $12 | 1M | 20250ms | |
| Gemini 3 Flash | 78.0% | — | — | 90.4% | 91.8% | 81.2% | — | 1,470 | 180 t/s | $0.5 | $3 | 1M | 8840ms | |
| OpenAI o3 | OpenAI | 71.7% | — | 96.7% | 87.7% | 92.9% | 82.9% | — | 1,432 | 65 t/s | $2 | $8 | 200K | 10880ms |
| Grok 4.1 Fast | xAI | — | — | — | — | — | — | — | 1,430 | 126 t/s | $0.2 | $0.5 | 2M | 620ms |
| DeepSeek V3.2 | DeepSeek | 67.8% | 39.6% | 96.0% | 82.4% | 88.5% | — | — | 1,421 | 39 t/s | $0.28 | $0.42 | 128K | 1370ms |
| DeepSeek R1-0528 | DeepSeek | 57.6% | — | 87.5% | 81.0% | 93.4% | — | — | 1,419 | 282 t/s | $1.35 | $5.4 | 130K | 570ms |
| Mistral Large 3 | Mistral | — | — | 53.3% | 43.9% | 85.5% | — | — | 1,415 | 43 t/s | $0.5 | $1.5 | 256K | 1050ms |
| Grok 3 | xAI | — | — | 93.3% | 84.6% | 92.7% | — | — | 1,411 | 69 t/s | $3 | $15 | 1M | 750ms |
| OpenAI o4-mini | OpenAI | 68.1% | — | 92.7% | 81.4% | 90.0% | 81.6% | — | 1,391 | 114 t/s | $1.1 | $4.4 | 200K | 64850ms |
| Grok 3 Mini | xAI | — | — | 95.8% | — | — | — | — | 1,363 | 183 t/s | $0.3 | $0.5 | 131K | 720ms |
| GPT-5 | OpenAI | 72.8% | 43.8% | 94.6% | 85.7% | 89.4% | 84.2% | 46.9% | 1,350 | — | $1.25 | $10 | 400K | — |
| Llama 4 Maverick | Meta | — | — | — | 69.8% | 92.4% | 73.4% | — | 1,327 | 126 t/s | $0.31 | $0.85 | 1M | 810ms |
| Llama 4 Scout | Meta | — | — | — | 57.2% | 87.2% | 69.4% | — | 1,322 | 149 t/s | $0.18 | $0.66 | 10M | 780ms |
| Claude Sonnet 4.5 | Anthropic | 77.2% | 50.0% | 87.0% | 83.4% | 89.1% | 77.8% | 55.3% | 1,320 | 80 t/s | $3 | $15 | 200K | — |
| Claude Sonnet 4.6 | Anthropic | 79.6% | — | 83.0% | 74.1% | — | 74.2% | — | — | 48 t/s | $3 | $15 | 1M | 730ms |
| Claude Haiku 4.5 | Anthropic | 73.3% | 40.0% | 96.3% | — | — | — | — | — | 105 t/s | $1 | $5 | 200K | 620ms |
| Claude Opus 4.1 | Anthropic | 74.5% | 46.5% | 78.0% | 81.0% | 89.5% | 77.1% | 50.9% | — | — | $15 | $75 | 200K | — |
| Claude Sonnet 4 | Anthropic | 72.7% | 36.4% | 70.5% | 76.1% | 86.5% | 74.4% | 44.5% | — | — | $3 | $15 | 200K | — |
| GPT-5.4 | OpenAI | 57.7% | 75.1% | — | 92.8% | — | 81.2% | 56.0% | — | — | $2.5 | $15 | 272K | — |
| GPT-5.3 Codex | OpenAI | 80.0% | 77.3% | 94.0% | 73.8% | — | 84.0% | — | — | 73 t/s | $1.75 | $14 | 400K | 125010ms |
| GPT-5.3 Codex Spark | OpenAI | 80.0% | 77.3% | 94.0% | — | — | — | — | — | 1000 t/s | — | — | 128K | — |
| GPT-5.3 Instant | OpenAI | — | — | — | — | — | — | — | — | — | — | — | 256K | — |
| GPT-5.2 | OpenAI | 80.0% | 47.6% | 100.0% | 92.4% | 91.0% | 86.7% | — | — | 71 t/s | $1.75 | $14 | 400K | 99790ms |
| GPT-4o | OpenAI | — | — | — | 53.6% | 88.7% | 69.1% | — | — | 100 t/s | $2.5 | $10 | 128K | — |
| Gemini 2.5 Pro | 67.2% | 25.3% | 88.0% | 86.4% | — | 82.0% | 29.4% | — | — | $1.25 | $10 | 1M | — | |
| Gemini 2.5 Flash | — | — | — | — | — | — | — | — | 350 t/s | $0.3 | $2.5 | 1M | — | |
| DeepSeek V3 | DeepSeek | — | — | 39.2% | 59.1% | 88.5% | — | — | — | 60 t/s | $0.27 | $1.1 | 64K | — |
| DeepSeek R1 | DeepSeek | — | — | 79.8% | 71.5% | 90.8% | — | — | — | — | $0.55 | $2.19 | 64K | — |
| Mistral Large 2 | Mistral | — | — | — | — | 84.0% | — | — | — | — | $2 | $6 | 128K | — |