Claude Opus 4.7 vs GPT-5.5

GPT-5.5 shipped today. April 23, 2026. It is now the most capable OpenAI model in production, and the first real head-to-head competitor to Claude Opus 4.7 since Opus 4.7 launched a week ago. Both models sit at the frontier. Both cost $5 per million input tokens. And both claim the top spot depending on which benchmark you look at.

This post uses OpenAI's official system card, third-party testing from MindStudio and Scale AI, and real routing decisions to answer one question: which model do you reach for, and when?

Quick Answer: Which Model Wins by Task

If you need the short version before the details:

Task	Best Model	Margin
Real-world PR resolution and refactors	Claude Opus 4.7	64.3% vs 58.6% on SWE-Bench Pro
Command-line agents and terminal work	GPT-5.5	82.7% vs 69.4% on Terminal-Bench 2.0
Multi-step tool orchestration (MCP)	Claude Opus 4.7	79.1% vs 75.3% on MCP Atlas
Web research and browsing	GPT-5.5 Pro	90.1% vs 79.3% on BrowseComp
Long context at 1M tokens	GPT-5.5	74.0% vs 32.2% on MRCR v2 8-needle
Finance work	Claude Opus 4.7	64.4% vs 60.0% on FinanceAgent v1.1
Frontier math (hard tier)	GPT-5.5	35.4% vs 22.9% on FrontierMath Tier 4
Abstract reasoning	GPT-5.5	85.0% vs 75.8% on ARC-AGI-2

No single model wins everything. The task determines the pick.

What GPT-5.5 Actually Is

GPT-5.5 is a new frontier model from OpenAI, not a minor revision of GPT-5.4. OpenAI co-designed it with NVIDIA GB200 and GB300 NVL72 systems. It matches GPT-5.4's per-token latency at higher intelligence, and uses significantly fewer tokens to complete the same Codex tasks.

Key specs:

Spec	GPT-5.5	Claude Opus 4.7
Context window (API)	1M tokens	1M tokens
Context window (Codex)	400K tokens	N/A
API input price	$5 per 1M tokens	$5 per 1M tokens
API output price	$30 per 1M tokens	$25 per 1M tokens
Pro/xhigh variant	$30/$180 per 1M tokens	No extra cost
API status	Not yet GA (ChatGPT + Codex live)	GA on API, Bedrock, Vertex, Foundry

One number to note on pricing: Claude Opus 4.7 is 17% cheaper on output at $25 per million tokens versus GPT-5.5's $30. On output-heavy workloads (long code generation, multi-turn agent runs, document drafting) that gap compounds fast.

GPT-5.5 Pro at $30/$180 is a separate pricing tier aimed at the hardest research and regulated-domain work. That is 6x the standard output rate.

Coding: Who Wins Depends on the Task Type

This is where the split is clearest.

SWE-Bench Pro measures resolution of real GitHub issues: the kind of bug reports and feature requests developers submit in production repos. Claude Opus 4.7 scores 64.3%. GPT-5.5 scores 58.6%. Gemini 3.1 Pro sits at 54.2%. On PR resolution work (reading a broken codebase, locating the root cause, writing a fix that passes tests) Opus 4.7 leads.

Terminal-Bench 2.0 measures command-line agent tasks: long-running shell scripts, multi-step CLI workflows, automated infrastructure work. GPT-5.5 scores 82.7%. Claude Opus 4.7 scores 69.4%. That is a 13-point gap. For terminal-heavy agent pipelines, GPT-5.5 is the better call.

One important caveat: OpenAI ran Terminal-Bench with a Codex CLI harness. Anthropic used the Terminus-2 scaffold. The evaluation environments differ, so the 13-point gap is directional, not precise.

Expert-SWE is an internal OpenAI evaluation on a harder class of software engineering problems. GPT-5.5 scores 73.1%. No comparable Opus 4.7 figure exists for this benchmark. Anthropic did not publish one.

MindStudio's live test (run April 21, before GPT-5.5 launched) put Claude Opus 4.7 against GPT-5.4 on a 465-file TypeScript migration. Opus 4.7 produced a 5.8% correction rate; GPT-5.4 hit 13.1%. Opus 4.7 raised 14 ambiguity flags that prevented downstream errors; GPT-5.4 raised 3. GPT-5.4 finished faster. That test covers GPT-5.4, not GPT-5.5. GPT-5.5 is meaningfully improved. But the pattern it shows (Claude flags more, catches more, runs slower) likely carries forward.

The practical split for coding:

Use Opus 4.7 for PR resolution, refactors, large messy codebases, and MCP-heavy tool chains. Use GPT-5.5 for terminal-heavy pipelines, new feature implementation in Codex, and well-scoped implementation tasks with clean specs.

Agents: Long-Horizon Coherence vs Terminal Performance

Both models are built for agentic work. They are not equally good at the same kind of agents.

MCP Atlas is the benchmark for tool orchestration at scale: multi-turn agents calling many tools in sequence, handling unexpected results, maintaining state. Claude Opus 4.7 scores 79.1%. GPT-5.5 scores 75.3%. Gemini 3.1 Pro sits at 78.2%. For MCP-native workflows where the agent is calling external services, reading files, querying APIs, and synthesizing across tools, Opus 4.7 holds the edge.

Terminal-Bench 2.0 (already covered above): GPT-5.5 leads by 13 points on command-line agentic work.

Toolathlon is a multi-modal tool-use eval. GPT-5.5 scores 55.6%. No comparable Opus 4.7 figure was published.

Tau2-bench Telecom (customer service agent tasks): GPT-5.5 scores 98.0%. That number comes with a footnote: Tau2-bench was run for GPT-5.5 without prompt tuning, while other labs' entries were evaluated with prompt adjustments. The comparison is unreliable without matching methodology.

OSWorld-Verified (desktop computer use, clicking through real UIs): GPT-5.5 scores 78.7%, Opus 4.7 scores 78.0%. Effectively tied.

For agent pipelines in Claude Code and Claude's API, Opus 4.7 day-one availability across Bedrock, Vertex AI, Anthropic Foundry, and the Claude API is an operational advantage. GPT-5.5's API is rolling out "very soon." It is not live yet.

Long Context: GPT-5.5 Pulls Ahead at Scale

Both models have a 1M token context window. How well they actually use that window is a different question.

OpenAI published MRCR v2 8-needle scores: a retrieval benchmark that hides 8 facts in a long document and asks the model to find all of them. The results show a widening gap as context grows:

Window Range	GPT-5.5	Claude Opus 4.7
4K–8K	98.1%	98.3%
32K–64K	90.0%	87.1%
128K–256K	87.5%	59.2%
512K–1M	74.0%	32.2%

At short context, they are equal. Past 128K, GPT-5.5 holds accuracy while Opus 4.7 drops sharply. At the full 1M window, GPT-5.5 retrieves at 74.0% accuracy. Opus 4.7 retrieves at 32.2%.

One caveat: the Opus 4.7 Graphwalks numbers in OpenAI's table are labeled as Opus 4.6, not Opus 4.7. Anthropic has not independently published Opus 4.7 long-context retrieval scores. The MRCR v2 figures are more reliable for this comparison.

For workloads that actually use a large fraction of a 1M token window (analyzing an entire monorepo, reading a year of legal filings, processing a large corpus of customer data) GPT-5.5 is the more reliable model at that scale.

Professional and Research Tasks

FinanceAgent v1.1 runs autonomous multi-step financial analysis tasks. Claude Opus 4.7 scores 64.4%. GPT-5.5 scores 60.0%. For financial agent work, Opus 4.7 leads.

GDPval measures performance across 44 professional occupations: a broad proxy for knowledge work. GPT-5.5 scores 84.9%. Opus 4.7 scores 80.3%. GPT-5.5 leads here.

OfficeQA Pro covers document-heavy office workflows. GPT-5.5 scores 54.1%. Opus 4.7 scores 43.6%. GPT-5.5 leads by 10 points.

Humanity's Last Exam covers extremely hard academic questions that require graduate-level reasoning. Without tools: Opus 4.7 at 46.9%, GPT-5.5 at 41.4%. With tools: Opus 4.7 at 54.7%, GPT-5.5 at 52.2%. Opus 4.7 leads on deep academic reasoning.

FrontierMath covers competition-level mathematics. Tier 4 is the hardest class. GPT-5.5 scores 35.4% on Tier 4 versus Opus 4.7's 22.9%. A 12.5-point gap. For hard quantitative work, GPT-5.5 wins.

ARC-AGI-2 is abstract reasoning on novel visual patterns. GPT-5.5 scores 85.0%. Opus 4.7 scores 75.8%. A clear 9-point gap. GPT-5.5 is meaningfully stronger on pattern generalization.

Cost Per Workload

Input pricing is identical: $5 per million tokens for both. The output price differs.

Daily coding session (200K tokens total, 60% output):

Model	Cost per session
Claude Opus 4.7	$1.70
GPT-5.5	$2.00

Long agent run (500K tokens, 70% output):

Model	Cost
Claude Opus 4.7	$9.25
GPT-5.5	$10.75

High-volume automation (10M tokens per month, 70% output):

Model	Monthly cost
Claude Opus 4.7	$185
GPT-5.5	$215

At scale, Opus 4.7's cheaper output pricing saves real money. That 17% output difference is not a rounding error on large pipelines.

GPT-5.5 Pro at $30/$180 is in a different category. It targets regulated-domain use cases (investment banking, legal review, high-stakes research) where the cost per call is small relative to the value of the output.

The Data Reliability Problem

Most of the numbers in this post come from OpenAI's own system card. That means OpenAI ran the benchmarks of all models, including Opus 4.7, using their own harnesses.

A few specific reliability issues:

Harness differences. Terminal-Bench was run by OpenAI with a Codex CLI scaffold and by Anthropic with Terminus-2. The 13-point gap may narrow or widen on a matched harness.

Long-context Opus figures. OpenAI's Graphwalks tables use Opus 4.6 data for some cells, labeled as such. Opus 4.7 long-context numbers are not independently published by Anthropic.

Expert-SWE. OpenAI's internal benchmark, no external replication possible.

Tau2-bench methodology mismatch. GPT-5.5 was tested without prompt tuning; other models were not. The 98.0% figure is not comparable on equal footing.

GPT-5.5 Pro scores. Several benchmarks list a "Pro" variant figure alongside the standard GPT-5.5 number. The Pro variant costs 6x more. Comparing Pro against standard Opus 4.7 is an apples-to-oranges cost comparison.

Independent third-party benchmarks (HELM, LMSYS, Artificial Analysis) had not indexed GPT-5.5 as of today. These numbers will change as external evaluations come in.

How to Route Between the Two Models

Four clean decision rules:

SWE-Bench-style PR work, MCP tool chains, finance agents, and academic reasoning. Opus 4.7. It holds better accuracy on real-world codebase tasks and leads on tool orchestration at scale. The 17% cheaper output rate makes it the default for long runs.

Terminal-heavy agents, Codex workflows, frontier math, ARC-AGI-style reasoning, and large contexts above 128K tokens. GPT-5.5. The Terminal-Bench lead is large. Long-context accuracy at 1M tokens is not close.

Web research and synthesis. GPT-5.5 Pro if accuracy matters. BrowseComp at 90.1% Pro vs 79.3% for Opus 4.7 is a real gap for retrieval-heavy workflows.

Budget-sensitive output-heavy pipelines. Opus 4.7. The $5 difference per million output tokens adds up on large-scale automation.

Both models are GA on Claude's API and Anthropic's cloud platforms today. GPT-5.5's API is still rolling out. If you need to ship something now, Opus 4.7 is live everywhere. GPT-5.5 will catch up shortly.

FAQ

Is Claude Opus 4.7 better than GPT-5.5?

It depends entirely on the task. Opus 4.7 leads on SWE-Bench Pro (64.3% vs 58.6%), MCP Atlas tool orchestration (79.1% vs 75.3%), FinanceAgent (64.4% vs 60.0%), and Humanity's Last Exam. GPT-5.5 leads on Terminal-Bench 2.0 (82.7% vs 69.4%), FrontierMath Tier 4, ARC-AGI-2 (85.0% vs 75.8%), and long-context retrieval above 128K tokens. For real-world PR resolution and MCP agents, Opus 4.7 wins. For terminal agents and research at scale, GPT-5.5 wins.

What does GPT-5.5 cost?

The standard API costs $5 per million input tokens and $30 per million output tokens. GPT-5.5 Pro costs $30 input and $180 output per million tokens. Batch and Flex pricing run at half the standard rate. The API is not yet generally available as of April 23, 2026. It is rolling out to the Responses and Chat Completions endpoints. ChatGPT and Codex access are live now for Plus, Pro, Business, and Enterprise plans.

Which model is better for agentic coding tasks?

Both are strong. Claude Opus 4.7 holds the edge on SWE-Bench-style PR resolution, MCP tool orchestration, and coherent multi-step reasoning with tools. GPT-5.5 leads on Terminal-Bench command-line tasks and new feature implementation in Codex, and uses fewer tokens to complete Codex tasks than GPT-5.4. The type of agent task determines which model to use.

Which model has better long-context performance?

GPT-5.5 at scale. MRCR v2 retrieval at 512K-1M tokens: GPT-5.5 at 74.0% versus Opus 4.7 at 32.2%. Both have a 1M token context window, but GPT-5.5 maintains retrieval accuracy across more of that window. For workloads that genuinely need to read and reason over hundreds of thousands of tokens, GPT-5.5 is the more reliable option above 128K.

Is GPT-5.5 available on the API yet?

Not fully. As of April 23, 2026, GPT-5.5 is available in ChatGPT (Plus, Pro, Business, Enterprise) and in Codex. The API rollout to Responses and Chat Completions is described as "very soon." Claude Opus 4.7 is GA on the Anthropic API, Amazon Bedrock, Google Vertex AI, and Anthropic Foundry.

Claude Opus 4.7 for the complete Opus 4.7 capability and safety breakdown
Claude Opus 4.7 vs Other Frontier Models for a five-model comparison including DeepSeek and Gemini
Model selection guide for per-task switching inside Claude Code
All Claude Models for the full Anthropic model timeline

Claude Opus 4.7 vs GPT-5.5

On this page