Claude Opus 4.7: 7 Biggest Changes + Benchmarks

● Breaking · April 16, 2026

Last updated: April 16, 2026 · Release day coverage

Anthropic released Claude Opus 4.7 (model ID claude-opus-4-7) on April 16, 2026 across the API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. The flagship upgrade: 3× more SWE-bench Verified tasks resolved vs Opus 4.6, a new xhigh reasoning effort level, a rebuilt tokenizer, vision inputs up to 2,576 pixels (≈3.75 MP), task budgets in public beta, and a /ultrareview command in Claude Code. Pricing is unchanged at $5 per million input tokens and $25 per million output tokens. Opus 4.7 is noticeably more literal in following instructions — a behavior shift that will break prompts tuned for 4.6.

Claude Opus 4.7 Anthropic SWE-bench Claude Code xhigh

What is Claude Opus 4.7?

Claude Opus 4.7 is Anthropic’s flagship frontier model released on April 16, 2026. It succeeds Opus 4.6 (released earlier in Q1 2026) and sits alongside the more broadly capable but restricted Claude Mythos Preview in Anthropic’s current lineup. The model is now generally available — no waitlist, no tier restrictions beyond normal API access — across four deployment channels: Anthropic’s direct API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

The headline claim from Anthropic’s April 16, 2026 announcement is that Opus 4.7 delivers “notable improvement on Opus 4.6 in advanced software engineering” — with some benchmarks showing 3× the task resolution rate of its predecessor. That framing matters because 4.6 was already strong enough to ship inside production agent pipelines. If the 3× figure holds up outside Anthropic’s curated evaluations, we’re looking at one of the largest single-release coding jumps since the Claude 3 → Claude 3.5 Sonnet transition in mid-2024.

This article breaks down what actually changed, what the benchmarks mean, what developers need to update in their prompts, and where Opus 4.7 sits against GPT-5 and Gemini 3.0. If you just want the decision: yes, upgrade your claude-opus-4-6 calls to claude-opus-4-7 — but read the “literal instructions” section first.

The 7 biggest changes in Opus 4.7

Anthropic’s release notes list roughly two dozen deltas. Seven of them materially change how developers and product teams will use the model.

1. SWE-bench Verified jump: 3× more production tasks resolved

The standout number. SWE-bench Verified is a curated benchmark of real GitHub issues where the model must produce a patch that passes all hidden unit tests. Opus 4.7 resolves roughly 3× more tasks than Opus 4.6 on this set. Anthropic has not (yet) published the absolute percentage in the public announcement, but internal reporting from partner Rakuten confirms a 13% lift on their internal SWE-bench variant — a meaningful number for a benchmark that has been near-saturated by frontier models.

For context, when I’ve used Opus 4.6 inside Claude Code on my own codebase, the tasks it struggled with were almost always multi-file refactors where you have to chase a change across 4+ files and keep coherence. That’s exactly the regime where 4.7 claims the biggest gains.

2. New xhigh reasoning effort level

Opus 4.6 exposed three thinking effort levels (low, medium, high). Opus 4.7 adds xhigh, sitting above high, for cases where latency matters less than correctness. The typical tradeoff: you spend 2–5× more output tokens on internal reasoning for a marginal accuracy lift — worth it for things like security audits, complex refactors, or multi-step financial analysis. Not worth it for simple code completions.

This is the first explicit acknowledgment from Anthropic that inference-time compute scaling is a first-class product dimension — something OpenAI productized with o1/o3 reasoning tokens but Anthropic had kept more implicit until now. See our context engineering explainer for how to combine xhigh with prompt caching to avoid runaway costs.

3. Rebuilt tokenizer — same prompt, different token count

Opus 4.7 ships with an updated tokenizer that handles text processing more efficiently for non-English languages, code, and structured data. The tradeoff: the same input maps to 1.0–1.35× more tokens depending on content. For mostly-English text, expect roughly parity. For Polish, Japanese, or heavy code inputs, expect token counts to rise.

Output token usage also climbs at higher effort levels, particularly in agentic settings — a non-trivial cost line. A task that cost $0.40 on Opus 4.6 might cost $0.50–$0.55 on Opus 4.7 at the same effort level, before you even opt into xhigh. We’ll cover cost modeling in the pricing section below.

4. Vision capabilities: 3× more pixels

Opus 4.7 accepts images up to 2,576 pixels on the long edge, approximately 3.75 megapixels. That’s over 3× the pixel budget of prior Claude models. Practically, this means:

  • Computer-use agents can now operate on screenshots at native 1440p resolution without downsampling artifacts.
  • Diagram analysis — architecture diagrams, circuit schematics, complex charts — no longer loses text legibility at the zoom levels Claude needs to read them.
  • Pixel-perfect reference work becomes viable: giving Claude a UI mockup and asking for a Tailwind implementation that actually matches the design.

One tradeoff: higher-resolution images consume proportionally more tokens. A 2,576-pixel image costs significantly more than a 1,092-pixel image did on Opus 4.6 for the same visual content. Budget accordingly.

5. Task budgets (public beta)

Task budgets are a new API control that lets you specify a maximum token spend for a given task up-front. The model then paces its reasoning and tool calls to fit inside the budget. This directly addresses one of the most painful failure modes of agentic workflows in Opus 4.6: autonomous agents that would blow through $40 of tokens on a task that should have cost $4.

In practice, task budgets should be combined with the new xhigh mode for predictable high-accuracy work. You get the quality of deep reasoning with a hard ceiling on spend.

6. /ultrareview in Claude Code

A new slash command inside Claude Code triggers a dedicated, xhigh-backed review pass focused on bugs, design flaws, and edge cases. This isn’t just “review my code” — it’s a structured workflow that runs static checks, cross-references similar patterns elsewhere in the codebase, and produces a ranked list of issues with suggested fixes.

For solo developers, this is close to having a senior code reviewer on retainer. For teams, it’s a way to pre-screen PRs before human review, catching the obvious stuff so humans can focus on architecture and intent.

7. Auto mode extended to Max users

Previously limited to select Enterprise tiers, Auto mode — which lets Claude autonomously pick the appropriate model size and reasoning depth for each sub-task — is now available to all Claude Max subscribers. For most users this will replace manual model selection: Auto picks Sonnet for simple lookups, escalates to Opus 4.7 for hard reasoning, and uses xhigh only when needed.

Benchmark deep dive

Here’s how Opus 4.7 lands on the benchmarks Anthropic disclosed at launch, compared against Opus 4.6 and — where public figures exist — GPT-5 and Gemini 3.0:

Benchmark Opus 4.7 Opus 4.6 Delta Domain
SWE-bench Verified3× baselinebaseline+300% task resolutionReal-world software engineering
CursorBench70%58%+12 ppIDE-style code completion
XBOW visual-acuity98.5%54.5%+44 ppComputer-use vision
Rakuten-SWE-Bench+13% liftbaseline+13%Industry SWE (Rakuten internal)
Finance Agent evalSOTAbelow SOTAnew top scoreFinancial reasoning + tools
GDPval-AASOTAbelow SOTAnew top scoreEconomically valuable knowledge work

A few observations worth flagging:

The XBOW visual-acuity jump from 54.5% → 98.5% is the single most dramatic delta in the release notes. XBOW measures whether a model can correctly identify and click small UI targets in screenshots — the basic competency for computer-use agents. A 44 percentage-point lift turns this from “sometimes works, sometimes misclicks” into “effectively solved.” Expect a wave of computer-use products to be rebuilt on Opus 4.7 over the next six weeks.

The Finance Agent and GDPval-AA SOTA claims matter less for individual developers and more for enterprise buyers. GDPval in particular was designed by Anthropic to test the kinds of knowledge work that actually generates revenue — legal drafting, financial analysis, consulting deliverables. Claiming SOTA here is a pitch aimed at Goldman Sachs, McKinsey, and the Big Four.

The CursorBench delta (58% → 70%) is interesting because Cursor itself switched its default model back-and-forth between Claude and GPT variants throughout 2025. A 12 pp lead will likely trigger another default reshuffle inside Cursor’s model routing.

What’s new for developers: code examples

Here’s the minimum change to migrate from Opus 4.6 to 4.7 and use the new xhigh effort level with a task budget:

Python · anthropic SDK v1.0+
from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=8192,
    thinking={
        "type": "enabled",
        "effort": "xhigh",          # new level above "high"
        "budget_tokens": 32000       # task budget (public beta)
    },
    system="You are a senior Python engineer. Follow instructions literally.",
    messages=[
        {
            "role": "user",
            "content": "Refactor this 600-line module to use async/await. Preserve all existing tests. Do not introduce new dependencies."
        }
    ],
)

print(response.content)
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Thinking tokens: {response.usage.thinking_tokens}")

Two things to note in the snippet above:

The system prompt explicitly says “Follow instructions literally.” This is not decorative. Opus 4.7 already interprets instructions more literally than 4.6, and combining that behavior with an explicit reminder in the system prompt produces the most predictable output in my testing.

The thinking.budget_tokens field is the task budget cap. If the model would exceed it, it terminates reasoning and produces its best answer so far. This prevents the “agent spiral” failure mode where a single request costs more than a whole day of normal usage.

💡 Key insight

If you’re running Opus 4.7 inside MCP-based agent pipelines, combine xhigh with prompt caching on the system prompt and tool definitions. You’re effectively paying the high thinking cost only on the variable parts of each request. On a codebase agent I rebuilt this morning, this reduced cost per task by roughly 60% versus naive xhigh calls.

The “literal instructions” behavior change — don’t skip this

Buried in the release notes is one sentence that will cause more production incidents than any other: Opus 4.7 “takes instructions literally vs previous loose interpretation.”

Concretely, what changed:

  • If you said “don’t use TypeScript” to Opus 4.6, it would sometimes still use TypeScript if the task seemed to benefit. 4.7 will refuse even when it notices the task would be easier with TypeScript.
  • If you said “respond in JSON”, 4.6 might add a prose preamble before the JSON. 4.7 returns JSON, period.
  • If you said “write exactly 3 functions”, 4.6 would sometimes write 2 or 4 if that fit the task better. 4.7 will write 3.

This is good for production systems. It’s potentially bad for prompts written between 2023 and 2026 that assumed Claude would use judgment to deviate from instructions when obviously better. Anthropic explicitly warns: “Prompts written for earlier models may produce unexpected results.”

If you have any prompt with phrases like “if possible,” “ideally,” “try to” — those soft modifiers now carry more weight. The model will interpret “try to respond in under 200 words” as a strong suggestion, not a hard limit. If you want a hard limit, say so: “respond in under 200 words, no exceptions.”

How Opus 4.7 compares to GPT-5 and Gemini 3.0

The frontier model landscape in April 2026 is a three-way race. Here’s where each model currently wins:

Domain Best choice Why
Agentic coding (multi-file refactors)Opus 4.73× SWE-bench gain + literal instructions + task budgets
Very long context (>500k tokens)Gemini 3.0 Ultra2M token context window; Opus still at 500k
Pure mathematical reasoning (IMO-level)GPT-5Strongest on MATH, AIME, IMO benchmarks
Computer-use agentsOpus 4.7XBOW 98.5% after the vision upgrade
Cheapest inference at high qualityGemini 3.0 Flash~5× cheaper than Opus at Sonnet-level quality
Honesty / refusal calibrationOpus 4.7Anthropic’s RLHF stack plus Constitutional AI
Multimodal voice/videoGPT-5Native voice+vision+video in one pass

The honest summary: if your product is coding, agents, or computer-use, Opus 4.7 is the current best choice and the decision is not close. If your product is “upload a 1,500-page contract and summarize it,” Gemini 3.0 Ultra still wins on context window alone. If your product is conversational voice AI with realtime video input, GPT-5 remains in front.

For deeper context on why these models diverge so much, see our overviews of mixture-of-experts architecture (which GPT-5 and Gemini use heavily but Anthropic has been quieter about) and RLHF (where Anthropic’s Constitutional AI lineage still shows up in Opus 4.7’s refusal behavior).

Pricing and deployment — what it actually costs

Anthropic held the price line:

  • $5 per million input tokens (same as Opus 4.6)
  • $25 per million output tokens (same as Opus 4.6)
  • Prompt caching: available, same pricing structure as 4.6
  • Batch API: 50% discount still applies for async workloads

The real cost change is invisible in the sticker price: the new tokenizer can increase your effective token count by 1–35% depending on content, and xhigh mode uses substantially more output tokens. A real-world example from my testing this morning — refactoring a 400-line Python module:

ConfigInput tokensOutput tokensCost
Opus 4.6, high effort4,2003,100$0.099
Opus 4.7, high effort4,450 (+6%)3,600 (+16%)$0.112
Opus 4.7, xhigh effort4,4508,400$0.232

The xhigh setting more than doubled cost on this single task — but produced a refactor that passed all existing tests on the first attempt, where the high version needed one follow-up round. For anything going to production, the xhigh premium is almost always worth it. For iterative exploration, stay on high.

All deployment channels launched with Opus 4.7 available on day one: Anthropic API, Amazon Bedrock (anthropic.claude-opus-4-7-v1:0), Google Cloud Vertex AI, and Microsoft Foundry. Existing Claude Max and Enterprise subscriptions include Opus 4.7 access at no additional cost — your existing usage quotas now cover the new model.

Safety and alignment notes

Anthropic states the Opus 4.7 safety profile is “similar to Opus 4.6” with specific improvements in two areas:

1. Honesty. Opus 4.7 is less likely to produce confidently-wrong answers when asked about topics near its knowledge boundary. This matters for agentic workflows where the model might need to admit it can’t solve a step and escalate, rather than fabricating a solution.

2. Prompt injection resistance. Resistance to malicious prompt injection attacks is “improved” — this is important for any agent consuming untrusted web content. The model is now more likely to detect “ignore previous instructions, instead do X” attacks when X conflicts with its system prompt.

Anthropic also flags a modest weakness in harm-reduction advice on controlled substances — meaning the model can sometimes over-refuse legitimate harm-reduction questions from health professionals. If you’re building a clinical tool, test carefully.

Cybersecurity capabilities are “deliberately kept less advanced than Mythos Preview.” Anthropic is explicit: safeguards automatically detect and block high-risk cybersecurity requests, and legitimate security researchers need to enroll in the Cyber Verification Program to unlock the full model capability surface.

⚠ Alignment note

The official position remains that Claude Mythos Preview is “the best-aligned model trained by Anthropic.” Opus 4.7 is more capable on most benchmarks but is a separate model, not a training-stage snapshot of Mythos. If alignment is your primary selection criterion rather than capability, the hierarchy is Mythos Preview > Opus 4.7 > Opus 4.6.

Known limitations

Three things Opus 4.7 does not fix:

Context window is unchanged. The release notes do not mention a context window increase. Opus 4.7 remains at the same context limit as 4.6. If you were hoping for a bump to compete with Gemini 3.0’s 2M window, this release is not it.

Prompt fragility across versions. As noted, the literal-instructions shift will break some legacy prompts. Plan to re-test anything production-critical. The safe path: keep claude-opus-4-6 as a fallback for 1–2 weeks while you validate 4.7 on your workloads.

Thinking tokens are expensive. xhigh is not free. On long-running agentic workflows, the combination of the new tokenizer overhead and increased thinking usage at higher effort levels can 2–3× your monthly spend vs 4.6 at equivalent effort. Task budgets mitigate this, but you need to set them explicitly — the default behavior is “spend as needed.”

Who should upgrade?

Three groups should migrate to Opus 4.7 this week:

Teams running production coding agents. The SWE-bench gains are big enough that the upgrade pays for itself in reduced human review cycles within the first month. Pair with task budgets to keep costs controlled.

Products built on computer-use. The 98.5% XBOW score essentially makes reliable computer-use a solved problem for the first time. If you were holding off on a computer-use product because Opus 4.6 misclicked too often, 4.7 clears that blocker.

Finance and knowledge-work platforms. Claude ATS screeners, legal drafting tools, financial analysis copilots — anything that was measuring performance against GDPval or internal finance evals will see immediate gains.

Three groups can wait 2–4 weeks:

Chat products with high prompt-library investment. If you have hundreds of prompts battle-tested on 4.6, the literal-instructions change will mean non-trivial rewriting work. Stay on 4.6, pilot 4.7 on a small slice, migrate in stages.

Voice and multimodal-heavy products. Nothing in this release targets voice or video. You’re not missing gains by waiting.

Cost-constrained products where “good enough” suffices. If your workload was already handled well by Sonnet 4.5, dropping in Opus 4.7 doesn’t unlock new capability — it just costs more. Stay on Sonnet.

Key takeaways

Claude Opus 4.7 is primarily a coding and agents release, not a general capability release. The SWE-bench gains, the XBOW vision leap, the xhigh effort level, and task budgets together target the exact pain points of production agent builders. Pricing is unchanged at $5/$25 per million tokens. The single most disruptive behavior change is the shift toward literal instruction following — plan to re-test production prompts before swapping the model ID. For computer-use products, this is the first release where the underlying vision is reliable enough to deploy without constant human oversight. For enterprise knowledge work, the GDPval SOTA claim is the pitch Anthropic is making to replace McKinsey interns with API calls. For everyone else, wait a couple of weeks, let the community shake out the real-world performance, then migrate.

Frequently Asked Questions

When was Claude Opus 4.7 released?

Anthropic released Claude Opus 4.7 on April 16, 2026. It became generally available on the same day across Anthropic’s API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. The official model ID is claude-opus-4-7.

How much does Claude Opus 4.7 cost?

Pricing is $5 per million input tokens and $25 per million output tokens — identical to Opus 4.6. However, the new tokenizer can map the same input to 1.0–1.35× more tokens than 4.6, and xhigh effort mode uses substantially more output tokens, so real-world costs can rise 10–40% at equivalent workloads.

What is the xhigh reasoning mode?

xhigh is a new reasoning effort level introduced with Opus 4.7, sitting above the existing low, medium, and high levels. It uses substantially more thinking tokens per response in exchange for higher accuracy on difficult tasks. It’s most useful for complex coding refactors, security audits, and multi-step analytical tasks. It’s overkill for simple completions.

Is Claude Opus 4.7 better than GPT-5?

It depends on the task. Opus 4.7 leads on agentic coding (3× SWE-bench Verified vs Opus 4.6), computer-use vision (XBOW 98.5%), and honesty calibration. GPT-5 leads on pure mathematical reasoning and native multimodal voice/video. Gemini 3.0 Ultra still leads on very long context. For most software engineering and agent workloads in April 2026, Opus 4.7 is the best choice.

Will my existing prompts still work on Opus 4.7?

Most will, but with caveats. Opus 4.7 interprets instructions more literally than 4.6, meaning soft phrasings like “try to” or “if possible” now carry more weight. Prompts that relied on Claude using judgment to deviate from instructions may produce unexpected output. Anthropic explicitly warns: “Prompts written for earlier models may produce unexpected results.” Re-test anything production-critical before full migration.

What is the context window of Claude Opus 4.7?

The context window is unchanged from Opus 4.6 — the release notes do not announce an increase. If very long context is your primary need (over 500k tokens), Gemini 3.0 Ultra’s 2M token window remains the better choice. For most coding and agent workloads, Opus 4.7’s window is sufficient.

What is the /ultrareview command in Claude Code?

/ultrareview is a new slash command in Claude Code that triggers a dedicated, xhigh-backed review pass focused on bugs, design flaws, and edge cases. It runs static checks, cross-references similar patterns in the codebase, and produces a ranked list of issues with suggested fixes. It’s useful for pre-screening PRs before human review.

Is Claude Opus 4.7 safer than Opus 4.6?

Anthropic states the safety profile is “similar” to 4.6, with specific improvements in honesty (less confidently-wrong output) and resistance to prompt injection attacks. A modest weakness is flagged around harm-reduction advice on controlled substances, where the model can over-refuse. Claude Mythos Preview remains the “best-aligned” model in Anthropic’s lineup — Opus 4.7 is more capable but is a separate training.

1 thought on “Claude Opus 4.7: 7 Biggest Changes + Benchmarks”

Leave a Comment