GPT 5.5 beats Claude in terminal automation, long-context retrieval, and math benchmarks, but Claude still leads in software engineering and tool orchestration.
OpenAI‘s most capable AI model was internally nicknamed “Spud.” That detail tells you something about how the team thinks about this launch. Not a revolution. A next step. A useful, reliable thing.
I was going through the official GPT-5.5 announcement, the benchmark tables, and the independent analysis that followed the April 23 release. In my opinion, ChatGPT 5.5 is genuinely impressive in specific areas and genuinely behind Claude in others. The question of actually being ahead doesn’t have one answer. It has five, depending on what you’re actually building.
Here’s what I found.
Key Takeaways
- GPT 5.5 is OpenAI’s first fully retrained base model since GPT 4.5.
- On Terminal-Bench 2.0, GPT 5.5 scored 82.7%, beating Claude Opus 4.7’s 69.4% by over 13 points.
- Claude Opus 4.7 still leads on SWE-bench Pro at 64.3% vs GPT 5.5’s 58.6%.
- GPT 5.5 is natively omnimodal, processing text, image, audio, and video in a single architecture.
- The API price doubled to $5/$30 per million input/output tokens.
- Available now on ChatGPT Plus, Pro, Business, and Enterprise. API went live on April 24.
What Is GPT 5.5?
GPT 5.5 is OpenAI’s flagship AI model released on April 23, 2026, designed for coding, reasoning, and agentic workflows.
OpenAI calls it their “smartest and most intuitive to use model yet”, and positions it as the next step toward an AI system that can take over complex, multi-step computer tasks without hand-holding.

Before I get into what changed, one thing worth knowing: this is the first fully retrained base model since GPT-4.5. GPT-5.1 through 5.4 were incremental updates layered on top of the same foundation. ChatGPT 5.5 is a different animal at the architecture level.
Where it fits in the model evolution
It is the first fully retrained OpenAI base model since GPT-4.5. OpenAI’s release cadence has accelerated to roughly one flagship every six weeks. GPT-5.4 shipped on March 5. GPT 5.5 arrived on April 23. And as per OpenAI, more are coming.
Greg Brockman, OpenAI’s co-founder and president, described ChatGPT 5.5 as a step toward a “super app” that combines ChatGPT, Codex, and an AI browser into a unified tool for enterprise users. That’s the longer-term vision. For now, what we have is a model that handles agentic coding, knowledge work, scientific reasoning, and computer use better than anything OpenAI has shipped before.
GPT 5.5 has two variants:
- GPT 5.5 (Standard): The flagship for basic tasks, like coding, research, writing, etc.
- GPT 5.5 Pro: Built for high-value work. Legal research, advanced data science, etc., where accuracy is more important than speed.
Core capabilities at a glance
At a high level, GPT 5.5 is designed to:
- Handle messy and complex tasks without needing step-by-step instructions.
- Navigate across tools autonomously until a task is complete.
- Process text, images, audio, and video inside a single unified model.
- Operates within a 1M token context window (400K in Codex).
- Run in agentic terminal environments with planning, iteration, and self-correction.
That last point is where the GPT 5.5 benchmark story gets interesting.
GPT 5.5 Features: What Actually Changed?
Improvements in reasoning and accuracy
The most concrete improvement in GPT 5.5 is long-context retrieval. On OpenAI’s MRCR v2 benchmark at 512K to 1M token contexts, GPT 5.5 jumped from 36.6% (GPT-5.4) to 74.0%. That’s more than doubling. It’s not just a bigger context window. The model actually uses more of it accurately.
At 128K to 256K tokens, GPT 5.5 scores 87.5% vs. Claude’s 59.2%. For workflows involving full codebases, large document sets, or long conversation logs, that gap’s huge in practice.
On advanced math, FrontierMath Tiers 1 to 3 reached 51.7%, ahead of Claude Opus 4.7’s 43.8%. GPT 5.5 Pro pushes this further to 52.4%.
Where it falls behind: on Humanity’s Last Exam without tools (pure academic knowledge), GPT 5.5 Pro scored 43.1%, trailing Claude Opus 4.7 at 46.9% and Mythos Preview at 56.8%. In the GPT 5.5 vs. Claude game, Claude still has an edge over zero-shot academic reasoning.
Agentic workflows and tool use
This is GPT 5.5’s strongest area, and the place where the generative AI race looks most different from how it did six months ago.
GPT 5.5 scored 84.9% on GDPval, a benchmark that tests agents across 44 real-world occupations from legal research to product management to financial analysis. On OSWorld-Verified, which measures whether a model can autonomously operate a real computer environment, it hit 78.7% against Claude’s 78.0%.
That’s essentially a tie on computer use. But on Tau2-bench Telecom, which tests complex customer service workflows, GPT 5.5 hit 98.0% without any prompt tuning. That’s a strong result for anyone building customer-facing agentic workflows.
One thing worth flagging is that on MCP-Atlas, Claude Opus 4.7 leads at 79.1% vs. GPT 5.5’s 75.3%. If your agents are coordinating multiple tools across a sprawling codebase, that gap isn’t trivial.
Speed and stability enhancements
ChatGPT 5.5 was co-designed with NVIDIA’s GB200 and GB300 NVL72 rack-scale systems. The practical result: per-token latency matches GPT-5.4 despite the model being significantly more capable.
That’s unusual. More capable models are usually slower. This one holds the line on speed.
There’s also a detail that got almost no attention in coverage: GPT 5.5 and Codex rewrote OpenAI’s own serving infrastructure before launch. That’s either a flex or a reliability story, depending on how you look at it. I’d argue it’s both. It also uses fewer tokens to complete the same tasks as GPT-5.4, which partially offsets the higher API price.

GPT 5.5 Benchmark Performance: A Closer Look
Before getting into the numbers, a caveat that I think matters: almost all of these benchmark results are vendor-reported by OpenAI. Independent replication is still catching up. I’ll note where third-party confirmation exists.
ChatGPT 5.5 vs. Claude on terminal bench 2.0
Terminal-Bench 2.0 tests a model’s ability to navigate and complete tasks in a sandboxed terminal environment. The tests include planning, iteration, and tool coordination, and real CLI workflows.
Here’s the full comparison across the most relevant benchmarks:
| Benchmark | GPT 5.5 | Claude Opus 4.7 | Claude Mythos Preview |
| Terminal-Bench 2.0 | 82.7% | 69.4% | 82.0% |
| SWE-bench Pro | 58.6% | 64.3% | N/A |
| OSWorld-Verified | 78.7% | 78.0% | N/A |
| MCP-Atlas | 75.3% | 79.1% | N/A |
| GDPval | 84.9% | N/A | N/A |
| FrontierMath Tiers 1-3 | 51.7% | 43.8% | N/A |
| HLE (no tools) | 41.4% | 46.9% | 56.8% |
| MRCR v2 (1M tokens) | 74.0% | N/A | N/A |
| CyberGym | 81.8% | 73.1% | N/A |
Source: OpenAI official announcement, third-party verification via BenchLM and VentureBeat. All figures vendor-reported unless otherwise noted.

The Terminal-Bench 2.0 result is GPT 5.5’s most decisive win. 82.7% against Claude Opus 4.7’s 69.4% is not a marginal lead.
But the Mythos Preview comparison is where things get complicated. Claude Mythos Preview scored 82.0%, just 0.7 points below GPT 5.5. That’s a statistical tie to be honest. And Mythos Preview is classified by Anthropic as a strategic defensive asset. It’s not a product you can use. The “beat Claude” headline rests on beating a model that the public can’t access.
What these benchmark results actually mean
There are a few things I kept coming back to.
First, the SWE-bench Pro numbers deserve scrutiny. OpenAI noted that Anthropic’s 64.3% result may be affected by memorization on a subset of problems. If true, that narrows the gap. But OpenAI also self-designed GDPval, the benchmark where they score highest. Every lab leads with the benchmarks that favor them. The AI benchmark landscape has not gotten cleaner in 2026.
Second, across 14 benchmark categories, GPT 5.5 leads on state-of-the-art scores vs. 4 for Claude Opus 4.7 and 2 for Gemini 3.1 Pro. On breadth, GPT 5.5 has a real lead. On depth, the picture is messier.
GPT 5.5 vs. Claude: Who Is Actually Ahead?
Where GPT 5.5 leads
Based on benchmark numbers and real-world reports, ChatGPT 5.5 has a clear advantage in:
- Terminal-based agentic workflows: CLI automation, DevOps pipelines, unattended terminal agents.
- Long-context retrieval: The MRCR v2 jump from 36.6% to 74.0% is the largest single improvement in this release.
- Computer use: Near parity with Claude on OSWorld-Verified, with a slight edge.
- Cybersecurity tasks: 81.8% on CyberGym vs Claude’s 73.1%.
- Advanced math: FrontierMath Tiers 1-3 and Tier 4 both favor GPT 5.5.
- Speed at scale: Matches GPT-5.4 latency while being more capable.
One NVIDIA engineer with early access said losing access to GPT 5.5 “feels like I’ve had a limb amputated.” Strong words.
Where Claude still holds ground
Claude Opus 4.7 leads meaningfully, not marginally in:
- Codebase-level software engineering: 64.3% vs 58.6% on SWE-bench Pro. For PR review, multi-language refactoring, and IDE-integrated coding, Claude still leads.
- Tool orchestration: 79.1% vs. 75.3% on MCP-Atlas benchmark. Anthropic’s model works better for large codebases with complex tool chains.
- Zero-shot academic reasoning: HLE without tools: 46.9% vs. 41.4%. For knowledge-intensive tasks that don’t need tool use, Claude has an edge.
- Output pricing: Claude Opus 4.7 costs $5/$25 per million tokens compared to GPT 5.5’s $5/$30, which is 17% cheaper on output.
- Multilingual Q&A: 91.5% vs. 83.2% for Opus 4.7.
The routing logic, if you’re making a practical decision, is terminal-first agents go to GPT 5.5; codebase-first agents stay on Claude. Both share the same 1M token context window. The differentiator is what each does at the top of that window.
Real-World Impact of GPT 5.5
Productivity gains across use cases
Senior engineers who tested the model during preview said ChatGPT 5.5 was noticeably stronger than both GPT-5.4 and Claude Opus 4.7 on reasoning and autonomy. One tester gave it a comment system re-architecture task and returned to a nearly complete 12-diff stack.
The GDPval score of 84.9% is also meaningful for knowledge workers. That benchmark tests 44 real-world occupations. Finance, legal, product management, and data science are all covered. For teams using generative AI to automate routine knowledge work, GPT 5.5 has a real productivity story here.
Approximately 4 million developers were already using Codex weekly at launch. More than 85% of OpenAI employees use it across engineering, finance, communications, marketing, and product.
Where the impact is still limited
A few things to keep realistic about:
- Omnimodal maturity is uneven. Audio and video processing has been integrated, but early reports show rougher edges compared to text and image performance. The generative AI multimodal promise is a foundation, not a finished product.
- API access was delayed. API went live on April 24, one day after the announcement. This created friction for developers who wanted to test immediately.
- Free tier rollout has no date. GPT 5.5 is currently only available to Plus, Pro, Business, and Enterprise subscribers. If you’re on the free tier, you’re waiting.
Limitations of GPT 5.5
No GPT 5.5 benchmark breakdown is complete without this section. Here’s what I’d flag for anyone making a production decision:
- Pricing jumped hard. API pricing doubled from $2.50/$15 to $5/$30 per million tokens. ChatGPT 5.5 Pro goes to $30/$180. OpenAI argues that token efficiency gains offset costs by roughly 40%, making the effective price increase closer to 20%. That math requires real workload testing to verify.
- Benchmarks are still mostly self-reported. OpenAI published a detailed benchmark table and included categories where they trail, which is a reasonable signal of confidence. But independent replication on the most important evals is still in progress. The AI benchmark landscape rewards skepticism.
- SWE-bench Pro concerns. OpenAI flagged potential memorization in Anthropic’s 64.3% score. This is notable, but OpenAI also has every incentive to flag the one benchmark where they lose.
- The Mythos comparison is misleading as a headline. GPT 5.5 narrowly beats a model that isn’t publicly available. That’s a real data point, but calling it “beating Claude” overstates the competitive picture for users.
What GPT 5.5 Signals for the Future of AI Models
The six-week release cadence is the most significant signal from this ChatGPT update, separate from any individual GPT 5.5 benchmark number.
OpenAI’s chief scientist, Jakub Pachocki, said the last two years of AI progress have been “surprisingly slow.” That framing suggests the current acceleration is expected to continue, not slow down.
For the generative AI industry, the implication is that “current state-of-the-art” has a shorter shelf life than ever. The AI benchmark lead you held in April may be gone by June.
There’s also an infrastructure story happening underneath the benchmark competition. ChatGPT 5.5 and Codex rewrote OpenAI’s own serving stack before launch. Brockman is pushing toward a “super app.” Anthropic launched Claude Managed Agents into public beta on April 8. The competition is shifting from “which model scores higher” to “which ecosystem do you build inside?”
That’s a longer race, and benchmark scores don’t determine it.
Final Thoughts
GPT 5.5 is a real step forward. The Terminal-Bench 2.0 score of 82.7% is not a manufactured number; it reflects genuine improvement in agentic terminal work. The long-context retrieval jump is large. The omnimodal architecture is a legitimate structural upgrade.
But whether GPT 5.5 is ahead of Claude depends entirely on the task. For terminal-first agents, pipeline automation, and long-context retrieval, yes. For codebase-level engineering, tool orchestration, and zero-shot academic reasoning, Claude Opus 4.7 still leads.
The more useful frame is to stop asking which model is “better.” Start asking which model routes better to the tasks you actually run. In 2026, with GPT 5.5, Claude Opus 4.7, and Gemini 3.1 Pro all within benchmark striking distance of each other, that routing question is worth spending more time on than the headline score.
FAQs
GPT 5.5 scored 82.7% on Terminal-Bench 2.0, ahead of Claude Opus 4.7’s 69.4% by more than 13 points. This benchmark tests real command-line workflows, including planning, tool use, and iteration.
GPT 5.5 is priced at $5 per million input tokens and $30 per million output tokens, double GPT-5.4’s rates. GPT 5.5 Pro is $30/$180. OpenAI says token efficiency gains make the effective cost increase closer to 20% for agentic workloads.
Based on recent history: GPT-5.4 shipped March 5, GPT 5.5 on April 23, a six-week gap. OpenAI has signaled that this cadence is expected to continue.
Currently no. GPT 5.5 is available to ChatGPT Plus, Pro, Business, and Enterprise subscribers. There is no announced rollout date for the free tier.
All benchmark figures in this article are vendor-reported by OpenAI unless otherwise noted. Independent third-party verification is ongoing. Figures reflect the state of the AI benchmark landscape as of April 2026.

