How much Opus 4.7 is better than 4.6?



On April 16, 2026, Anthropic released Claude Opus 4.7 — its most capable generally available model to date. Just two months after Opus 4.6 landed in February, the new release brings targeted improvements in coding, vision, and agentic workflows while keeping API pricing identical. This post breaks down exactly what changed between the two models, where each one excels, and what developers should consider before migrating.

The Big Picture

Opus 4.7 is not a ground-up rearchitecture. It is an iterative refinement of the Opus 4.6 foundation, with the most significant gains concentrated in three areas: coding performance, visual understanding, and long-running agentic reliability. The model also ships with Anthropic's first production cybersecurity safeguards — a direct consequence of the company's Project Glasswing initiative and its decision to keep the more powerful Claude Mythos Preview under restricted access.

A useful mental model: low-effort Opus 4.7 delivers roughly the same quality as medium-effort Opus 4.6. The model is simply more efficient at allocating its reasoning capacity.

Coding: The Headline Improvement

Coding is where Opus 4.7 pulls furthest ahead of its predecessor. On CursorBench — a benchmark that tests real-world coding inside an IDE environment — Opus 4.7 scores 70% compared to Opus 4.6's 58%, a 12-point jump. On Anthropic's internal 93-task coding benchmark, the improvement is 13%, and the model solved four tasks that neither Opus 4.6 nor Sonnet 4.6 could handle at all.

SWE-bench Verified, one of the most widely tracked software engineering benchmarks, went from 80.8% (Opus 4.6) to 87.6% (Opus 4.7). On the harder multi-language variant SWE-bench Pro, scores jumped from 53.4% to 64.3% — leapfrogging both OpenAI's GPT-5.4 (57.7%) and Google's Gemini 3.1 Pro (54.2%).

Early-access partners reported concrete improvements: Warp confirmed the model passed Terminal-Bench tasks that previous Claude models had failed, including a tricky concurrency bug that Opus 4.6 could not crack. Factory observed a 10-15% lift in task success for their autonomous engineering agents, with fewer tool errors and more reliable follow-through on validation steps. Vercel noted that Opus 4.7 is more correct and complete on one-shot coding tasks and is more transparent about its own limitations.

Perhaps the most striking report came from a team that had Opus 4.7 autonomously build a complete Rust text-to-speech engine from scratch — neural model, SIMD kernels, and browser demo — and then verify its own output by feeding it through a speech recognizer.

Vision: The Biggest Leap

Vision is arguably the most dramatic improvement between the two models. Opus 4.7 accepts images up to 2,576 pixels on the long edge (approximately 3.75 megapixels), more than three times the resolution capacity of Opus 4.6, which topped out at around 1.15 megapixels.

The impact on downstream performance is massive. XBOW, which does autonomous penetration testing, reported visual-acuity scores jumping from 54.5% on Opus 4.6 to 98.5% on Opus 4.7. On visual navigation benchmarks without tools, Opus 4.7 scored 79.5% versus 57.7% for Opus 4.6. The OSWorld-Verified computer-use benchmark climbed from 72.7% to 78.0%.

For developers building anything that processes screenshots, UI mockups, diagrams, scanned documents, or charts, this upgrade alone may justify the migration. Where Opus 4.6 would blur or lose detail in dense visual content, Opus 4.7 can read small-font buttons, parse chart values, and extract table data from scanned documents with far greater accuracy.

New Effort Level: xhigh

Opus 4.7 introduces a new xhigh ("extra high") effort level that sits between the existing high and max settings. This gives developers finer control over the tradeoff between reasoning depth and response latency.

Anthropic recommends starting with high or xhigh for coding and agentic use cases. Claude Code now defaults to xhigh for all plans. The max effort level still yields the highest benchmark scores (approaching 75% on coding tasks), but xhigh offers a practical sweet spot for most production workloads where latency matters.

Instruction Following and Self-Verification

One of the more subtle but impactful differences: Opus 4.7 follows instructions more literally than Opus 4.6. Where the older model would sometimes loosely interpret or skip certain steps in complex prompts, Opus 4.7 takes instructions precisely.

This is a double-edged characteristic. For new prompts, stricter instruction following means more predictable outputs. For existing prompts optimized for Opus 4.6's looser interpretation, the same prompts may produce different results. Anthropic explicitly recommends testing on representative traffic before switching production workloads.

The model also shows a new behavior: it actively devises ways to verify its own outputs before reporting back. Vercel noted that Opus 4.7 even runs proofs on systems code before starting work — behavior not seen in earlier Claude models. Hex reported that it correctly flags missing data instead of providing plausible-sounding but incorrect fallbacks, and resists dissonant-data traps that Opus 4.6 would fall for.

Task Budgets (Public Beta)

To manage costs associated with longer reasoning runs, the Claude API introduces "task budgets" alongside Opus 4.7. This feature lets developers set a hard ceiling on token spend for autonomous agents, ensuring a long-running debugging session or multi-step workflow does not result in an unexpected bill. This is particularly relevant because Opus 4.7's updated tokenizer can increase token counts by 1.0x to 1.35x for the same content.

Claude Code Enhancements

Within Claude Code, Opus 4.7 brings a new /ultrareview command. Unlike standard code reviews that focus on syntax errors, /ultrareview simulates a senior human reviewer, flagging subtle design flaws and logic gaps. Additionally, "auto mode" — where Claude makes autonomous decisions without constant permission prompts — has been extended to Max plan users.

Cybersecurity Safeguards

Opus 4.7 is the first Claude model to ship with automated detection and blocking for prohibited or high-risk cybersecurity uses. This is a direct result of Anthropic's Project Glasswing initiative and the decision to test these safeguards on a less capable model before eventually rolling them out to Mythos-class systems.

Security professionals who need Opus 4.7 for legitimate cybersecurity purposes — vulnerability research, penetration testing, red-teaming — can apply through Anthropic's new Cyber Verification Program.

Where Opus 4.6 Still Holds Up (or Where 4.7 Regresses)

Not everything improved. Terminal-Bench 2.0 shows a regression: GPT-5.4 scores 75.1% versus Opus 4.7's 69.4%. BrowseComp also softened compared to Opus 4.6. If terminal-based coding or web browsing workflows are central to your use case, these regressions are worth evaluating.

Multilingual performance improved only incrementally. Google's Gemini 3.1 Pro still leads on multilingual Q&A benchmarks.

The updated tokenizer is another consideration. While per-token pricing stays at $5 input / $25 output per million tokens, the same content can consume up to 35% more tokens under the new tokenizer. For high-volume applications processing lots of images, actual costs may increase even though the unit price has not changed.

Benchmark Summary



Pricing and Availability

Opus 4.7 is available today across Claude products, the Anthropic API (model identifier: claude-opus-4-7), Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry (including Microsoft 365 Copilot).

Pricing is identical to Opus 4.6: $5 per million input tokens and $25 per million output tokens, with up to 90% savings through prompt caching and 50% through batch processing. US-only inference is available at 1.1x pricing.

Evolution at a Glance

graph LR A["Opus 4.6
Feb 2026"] -->|"Iterative Upgrade"| B["Opus 4.7
Apr 2026"] B --> C["Coding
+12pts CursorBench
+13% internal bench"] B --> D["Vision
3x resolution
98.5% acuity"] B --> E["Agentic
xhigh effort level
Task budgets"] B --> F["Safety
Cyber safeguards
Project Glasswing"] style A fill:#e8e8e8,stroke:#888,color:#333 style B fill:#d4a574,stroke:#8b6914,color:#333 style C fill:#c5e1a5,stroke:#558b2f,color:#333 style D fill:#90caf9,stroke:#1565c0,color:#333 style E fill:#ce93d8,stroke:#7b1fa2,color:#333 style F fill:#ef9a9a,stroke:#c62828,color:#333

Should You Upgrade?

Upgrade if you rely on Claude for coding tasks, computer use, vision-heavy workflows, or long-running agentic processes. The improvements in these areas are substantial and come at no additional per-token cost.

Evaluate carefully if your prompts are heavily optimized for Opus 4.6's behavior. The stricter instruction following may require prompt adjustments. If you process large volumes of content, test the tokenizer's impact on your actual token consumption before switching production traffic.

Hold off if your primary workload is terminal-based coding (where GPT-5.4 currently leads) or multilingual content (where Gemini 3.1 Pro has an edge).

For most developers, Opus 4.7 is a clear step forward — more capable, more reliable, and more efficient at the same price point. The practical recommendation: start with xhigh effort, test your existing prompts on representative inputs, and monitor token usage for the first week after migration.


Sources: Anthropic official announcement, Anthropic model card, early-access partner testimonials, benchmark data from Vellum AI, NxCode, and VentureBeat.

Building Tech Startups. Experience in Full Stack Web Development & Data Engineering.