Introducing GPT-5.5: 9 Key Benchmarks and API Facts

Last updated: April 2026 · Author: Ignacy Kwiecień · Reading time: ~12 min · Release-day coverage

Introducing GPT-5.5: the first fully retrained base model since GPT-4.5

GPT-5.5 is OpenAI’s April 23, 2026 release — the first fully retrained base model since GPT-4.5. Headline numbers: 82.7% on Terminal-Bench 2.0 (SOTA agentic coding), 84.9% parity with experts across 44 professions (GDPval), 39.6% on FrontierMath Tier 4 (vs 22.9% for Claude Opus 4.7). API: $5/$30 per 1M input/output tokens, 1M-token context window, same per-token latency as GPT-5.4.

OpenAI GPT-5.5 Frontier LLM Agentic coding Responses API

What is GPT-5.5, and why is this not just another point release?

GPT-5.5 is the model OpenAI released on April 23, 2026, and it is — by OpenAI’s own framing — the first fully retrained base model since GPT-4.5 (February 2025). That matters technically. Every version in between (5.0, 5.1, 5.2, 5.4) was a post-training or fine-tune on the GPT-5 base. GPT-5.5 is a new pre-training run: new data mix, new architecture decisions, a fresh capability frontier.

The model ships on three surfaces:

  • ChatGPT — Plus, Pro, Business, Enterprise. GPT-5.5 Pro is Pro/Business/Enterprise only.
  • Codex — OpenAI’s agentic coding CLI/IDE. GPT-5.5 becomes the default model.
  • API — Responses API and Chat Completions, with a 1M-token context window from day one.
Context for DTF readers: if you followed our Claude Opus 4.7 coverage, GPT-5.5 is the direct counter-release. On frontier math it outperforms Opus 4.7 by a wide margin; on pricing it is the more expensive option. The practical question is not “which is best” — it is “which is best per dollar for my workload,” and that depends on task mix.

What are the actual benchmark numbers for GPT-5.5?

OpenAI published three benchmark families covering agentic execution, knowledge work, and frontier reasoning.

Benchmark What it measures GPT-5.5 result Comparison
Terminal-Bench 2.0 End-to-end agentic coding in a terminal 82.7% State-of-the-art; narrowly ahead of Claude Mythos Preview
GDPval Knowledge work across 44 occupations (law, finance, PM, medicine) 84.9% Match or beat vs industry professionals
FrontierMath Tier 1–3 Research-level mathematics (authored by working mathematicians) 52.4% (Pro) Category leader
FrontierMath Tier 4 Hardest tier, specifically designed to resist memorization 39.6% (Pro) vs 22.9% for Claude Opus 4.7 — nearly 2×

Three readings of these numbers you won’t find in the press release:

  1. Terminal-Bench 2.0 is not LeetCode. It is an end-to-end agentic evaluation: the model has to decide when to run a test, when to read a log, when to stop. 82.7% on this suite means GPT-5.5 starts collapsing the dev → model → dev feedback loop into “dev reviews final patch,” which changes how you staff engineering teams, not just how you write prompts.
  2. GDPval at 84.9% is a warning shot for knowledge workers. This is not “AI replaces lawyers” — it is “the routine, benchmarkable deliverable in law, financial analysis, and product management is now commodity.” The remaining competitive edge shifts to what GDPval cannot measure: judgment, accountability, client relationships, timing.
  3. FrontierMath Tier 4 at 39.6% is genuinely new territory. As recently as 2024 every frontier model scored in the single digits on this benchmark. Doubling Claude Opus 4.7’s Tier 4 score suggests OpenAI invested heavily in reinforcement learning on multi-step mathematical reasoning, not just broader training.

How much does GPT-5.5 cost, and how does it compare?

The API price doubled compared to GPT-5.4, which is the clearest signal that OpenAI treats 5.5 as a generational step, not a refresh.

Model Input / 1M tokens Output / 1M tokens Context
GPT-5.5 $5 $30 1M
GPT-5.5 Pro $30 $180 1M
GPT-5.4 (reference) $2.50 $15 400k
Claude Opus 4.7 (reference) $15 $75 1M

For anyone running a production LLM app, this is an architecture decision, not a pricing footnote. If you move 100% of traffic from GPT-5.4 to GPT-5.5, your inference bill doubles. The cleaner answer is a cost-aware router: cheap intent classifier out front, GPT-5.4 for routine requests, GPT-5.5 for agentic/long-context, GPT-5.5 Pro for frontier reasoning.

Python — cost-aware router (production pattern)
from openai import OpenAI
from typing import Literal

client = OpenAI()

TaskType = Literal[
    "simple_qa",        # routine — stays on 5.4
    "long_document",    # needs 1M context — 5.5
    "agentic_coding",   # Terminal-Bench class — 5.5 Pro
    "research",         # frontier math/science — 5.5 Pro
    "knowledge_work",   # GDPval class — 5.5
]

def route_model(task: TaskType, context_tokens: int) -> str:
    """
    Route to the cheapest model that meets the capability bar.
    GPT-5.5 is 2x the cost of 5.4; 5.5 Pro is 6x on top.
    Only escalate when ROI justifies it.
    """
    if task in ("agentic_coding", "research"):
        return "gpt-5.5-pro"     # $30 / $180
    if task == "long_document" or context_tokens > 300_000:
        return "gpt-5.5"         # $5 / $30, 1M context
    if task == "knowledge_work":
        return "gpt-5.5"         # GDPval-class tasks benefit
    return "gpt-5.4"             # $2.50 / $15 — default

response = client.responses.create(
    model=route_model("agentic_coding", 50_000),
    input="Debug the failing integration test and fix the root cause.",
)
print(response.output_text)

What does a 1M-token context window actually unlock?

A 1M-token window is roughly 750k English words — about 1,500 A4 pages, or the entire text of War and Peace twice over. In practice, that changes four classes of workloads:

  • Codebase-aware agents. Most mid-sized open-source Python repos land at 200–500k tokens. GPT-5.5 can load the whole tree in a single call, rather than relying on a RAG step whose retrieval you have to debug.
  • Regulatory document analysis. The EU AI Act plus GDPR plus MiCA fit in a single context. You can ask about conflicts across the regime without stitching retrievals yourself.
  • Financial due diligence. A 10-K (500 pages) plus four quarters of earnings-call transcripts plus guidance fits in one shot. That was a multi-document RAG problem a year ago.
  • Literature and academic review. Several monographs at once, with genuine cross-referencing between them rather than stitched summaries.
Cost guardrail: 1M input tokens on GPT-5.5 costs $5 per call. On GPT-5.5 Pro, $30. A single errant full-context call is equivalent to a sit-down dinner. Put a hard max_input_tokens on your agent loop and log usage.input_tokens on every call — it’s the one line of observability that pays for itself in week one.

GPT-5.5 vs GPT-5.5 Pro: when is the 6× price justified?

OpenAI positions two variants. The base GPT-5.5 is meant for most production work; GPT-5.5 Pro is reserved for frontier reasoning. The six-fold output price gap ($30 vs $180) is not subtle, and the decision deserves more than a gut call.

GPT-5.5 vs GPT-5.5 Pro — which variant to choose Decision diagram showing GPT-5.5 at $5/$30 per 1M tokens for RAG, knowledge work, and production agents, and GPT-5.5 Pro at $30/$180 for frontier math, research, and the hardest agentic coding. Both ship with a 1M-token context window. GPT-5.5 vs GPT-5.5 Pro — which variant to choose DecodeTheFuture.org GPT-5.5, GPT-5.5 Pro, OpenAI, LLM, benchmark, agentic coding, frontier math Decision diagram comparing GPT-5.5 variants with application recommendations. Diagram image/svg+xml en © DecodeTheFuture.org GPT-5.5 vs GPT-5.5 Pro Which variant should you pick? GPT-5.5 $5 / $30 / 1M tok. Pick when: • Production RAG • Knowledge work • Agents on a budget • Chat apps • Document analysis Benchmarks: Terminal-Bench 82.7% GDPval 84.9% Latency: matches GPT-5.4 Context: 1M tokens GPT-5.5 Pro $30 / $180 / 1M tok. Pick when: • Scientific research • Frontier math • Hard debugging • Legal analysis • Multi-step reasoning Benchmarks: FrontierMath T4: 39.6% FrontierMath T1-3: 52.4% ChatGPT Pro only 6× base output cost Source: openai.com/index/introducing-gpt-5-5 · 2026-04-23

A simple ROI heuristic for GPT-5.5 Pro

Ask one question: what does a wrong answer cost? If a bad legal summary, a bad trading call, or a missed bug in production code costs more than $200 to clean up, Pro pays for itself even at one query per ~7,000 tokens. If the cost of a wrong answer is a minor retry, Pro is waste.

What changed in GPT-5.5’s safety stack?

OpenAI shipped GPT-5.5 with what it describes as its “strongest set of safeguards to date” — the details live in the GPT-5.5 Deployment Safety Hub. The practical changes that matter for builders:

  • Full Preparedness Framework evaluation across biology, cyber, autonomy, and self-exfiltration tracks.
  • Internal and external red-teaming before release, with dedicated cyber and biology capability testing — OpenAI now assigns the model to higher risk tiers in those domains.
  • Feedback from ~200 trusted early-access partners on real deployment cases, not just synthetic red-team prompts.
  • Tightened controls on sensitive cyber requests, personal data handling, and repeated-misuse detection at the account level.

Translation for teams about to roll out GPT-5.5 in production: expect a higher refusal rate on edge queries than GPT-5.4. If your pipeline depended on the older model’s willingness to handle pen-testing questions, adversarial data, or certain categories of chemistry — your regression suite will light up. Run a 500-prompt A/B on representative production traffic before flipping the default.

Where does GPT-5.5 fit in the EU AI Act?

GPT-5.5 is a general-purpose AI model (GPAI) under Regulation (EU) 2024/1689. Given its capability profile (FrontierMath, Terminal-Bench) and the 10^25 FLOPs training threshold in Article 51, it almost certainly qualifies as GPAI with systemic risk (Articles 51–55).

OpenAI’s provider obligations include standardized evaluations (Art. 55(1)(a) — covered by the Deployment Safety Hub), systemic risk assessment and mitigation (Art. 55(1)(b) — covered by the Preparedness Framework), serious incident reporting to the AI Office (Art. 55(1)(c)), adequate cybersecurity, and a public training-data summary.

If you are the deployer — the company wiring GPT-5.5 into your product — your obligations depend on the use case:

  • High-risk systems under Annex III (credit scoring, hiring, education, justice administration): full Article 26 applies — human oversight, logging, continuous monitoring, a Fundamental Rights Impact Assessment (FRIA), and a Data Protection Impact Assessment (DPIA).
  • Transparency (Art. 50): users must know they’re interacting with an AI system.
  • Deepfake labeling (Art. 50(4)): generative output must be marked as AI-generated with a machine-readable signal.
US/UK readers: even if you don’t ship into the EU directly, the Act reaches you if your output is used in the EU. A US fintech shipping credit decisions through an EU partner inherits the deployer stack. This mirrors the extraterritorial pattern of GDPR.

Five workflows where GPT-5.5 meaningfully changes the job

1. Agentic coding in CI/CD

With 82.7% on Terminal-Bench 2.0, the “failing test → read log → propose fix → run tests → iterate” loop can run without a human in the loop for small bugs. In Codex CLI I’ve been running GPT-5.5 as a pre-PR debugger on feature branches; the conversion rate from auto-fix to merged PR is materially higher than with GPT-5.4.

2. Financial due diligence at 1M context

Load a 10-K, four earnings-call transcripts, and the forward guidance from a full year — roughly 600k tokens — and ask for inconsistencies between forward-looking statements and reported results. A year ago this was a multi-retrieval RAG problem with manual reconciliation. Now it’s one call, about $3 of input.

3. Knowledge work commoditization

GDPval at 84.9% is the clearest sign yet that routine knowledge-work deliverables are converging to commodity. For retail traders and junior analysts — we’ve written about this in our loss aversion piece — the edge shifts to what benchmarks cannot measure: conviction, timing, and accountability for outcomes.

4. Research assistance at frontier math

GPT-5.5 Pro at 39.6% on FrontierMath Tier 4 is no longer “a better calculator.” It becomes a collaborator for routine verification of proof steps in research mathematics. It won’t replace mathematicians, but it makes certain classes of verification work tractable in an afternoon instead of a week.

5. Regulatory and compliance review

For compliance teams in regulated industries, the 1M context window is the feature, not the benchmarks. Load your full compliance policy stack plus the latest regulatory text (EU AI Act + sector-specific guidance) and ask for conflicts. We cover the fintech use case in our AI credit scoring under the EU AI Act explainer.

Migrating from GPT-5.4: a practical checklist

Python — minimal GPT-5.5 call via Responses API
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from env

# Simple text call
response = client.responses.create(
    model="gpt-5.5",
    input="Explain the GDPval benchmark in three sentences.",
    max_output_tokens=500,
)
print(response.output_text)

# With agentic tools (function calling)
tools = [{
    "type": "function",
    "name": "get_stock_price",
    "description": "Return the current stock price for a ticker.",
    "parameters": {
        "type": "object",
        "properties": {"ticker": {"type": "string"}},
        "required": ["ticker"],
    },
}]

response = client.responses.create(
    model="gpt-5.5",
    input="What's the current price of AAPL?",
    tools=tools,
)

Six-step migration checklist from GPT-5.4 → GPT-5.5:

  1. Change model="gpt-5.4" to "gpt-5.5" on one endpoint as a canary. Don’t flip the whole stack at once.
  2. Log response.usage.input_tokens and output_tokens for a week before and after. Budget for a 2× cost increase on the canary path.
  3. Replay 500 representative production prompts and compare refusal_rate. Expect it to tick up; make sure the increase lands on genuinely sensitive prompts, not false positives in your core flow.
  4. If you use structured outputs (JSON schema), re-validate every schema against GPT-5.5 output. New base models have caught edge-case parsing regressions in past releases.
  5. For context windows above 300k tokens, explicitly test chunk ordering. The “lost in the middle” problem is reduced on GPT-5.5 but not eliminated — critical content should still land in the top or bottom 20% of the prompt.
  6. Monitoring alerts: per-token latency matches GPT-5.4, so latency SLOs can stay as-is. Update cost alerts to the new $5/$30 baseline.

FAQ

When was GPT-5.5 released?

OpenAI released GPT-5.5 on April 23, 2026. The ChatGPT rollout began the same day for Plus, Pro, Business, and Enterprise users; GPT-5.5 Pro reached Pro/Business/Enterprise accounts immediately. The API (Responses and Chat Completions) was available on day one.

How much does GPT-5.5 cost via the API?

GPT-5.5 is priced at $5 per 1M input tokens and $30 per 1M output tokens. GPT-5.5 Pro costs $30 / $180 per 1M input/output tokens. Both variants ship with a 1M-token context window. The base price is exactly double GPT-5.4’s $2.50/$15, which OpenAI frames as justified by the fresh pre-training run and the capability jump.

Is GPT-5.5 better than Claude Opus 4.7?

It depends on the workload. On FrontierMath Tier 4 GPT-5.5 Pro (39.6%) clearly beats Claude Opus 4.7 (22.9%). On Terminal-Bench 2.0 GPT-5.5 (82.7%) narrowly beats Claude Mythos Preview. Claude Opus 4.7 remains competitive on long-horizon conversational tasks and is cheaper ($15/$75 vs Pro’s $30/$180). For frontier math and pure agentic coding, pick GPT-5.5. For long multi-turn conversations with tool use on a tighter budget, Opus 4.7 is still worth benchmarking against your actual traffic.

Can I use GPT-5.5 in the free tier of ChatGPT?

No. GPT-5.5 requires ChatGPT Plus ($20/month), Pro ($200/month), Business, or Enterprise. Free-tier users continue to get older models with rate limits. GPT-5.5 Pro is restricted to Pro/Business/Enterprise plans.

What does “first fully retrained base model since GPT-4.5” mean?

Between GPT-4.5 (February 2025) and GPT-5.5 (April 2026), OpenAI shipped GPT-5, 5.1, 5.2, and 5.4. Those were post-training iterations and fine-tunes on the GPT-5 base — not new pre-training runs. GPT-5.5 is a new base model: fresh pre-training from scratch, new data mix, new architectural choices. That’s why benchmarks jump in steps rather than linearly, and why the API price doubled.

Does the 1M context window apply in ChatGPT too, or only in the API?

In the API, yes — the full 1M context is available on day one. In the ChatGPT interface, the per-conversation limit is smaller and was not officially published at launch (historically OpenAI caps ChatGPT Pro at ~128k–200k tokens). To exercise the full 1M, you need the API or Codex.

Do I need a new contract with OpenAI because of the EU AI Act?

No — GPT-5.5 is available under existing OpenAI Terms of Service. But if you’re deploying it inside a high-risk system under Annex III of the AI Act (credit scoring, hiring, education, justice), you as the deployer must meet Article 26: human oversight, logging, monitoring, a Fundamental Rights Impact Assessment (FRIA), and a Data Protection Impact Assessment (DPIA). That obligation is independent of your contract with OpenAI and applies even to US/UK companies whose output reaches EU users.

Bibliography

Related DTF coverage: Claude Opus 4.7 explained · Claude Design explained · AI credit scoring under the EU AI Act · Loss aversion explained · Richard Thaler — behavioral economics

Leave a Comment