Anthropic's official Claude Opus 4.7 announcement image — Image: Anthropic

Model Launch

By Sam Taylor with SamwiseMay 16, 2026

On vision at 3x resolution, CursorBench jumping from 58 to 70, and the new xhigh effort level nobody is talking about.

Claude Opus 4.7 is a month old. Here's what actually changed.

Source lean on this story

▲ avg

Anti-AI

Skeptic

Neutral

Pro (practical)

Pro (hyped)

← Anti-AI · Pro-AI →

Anthropic released Claude Opus 4.7 a month ago. It's been out long enough now that real builders have shipped with it, and the early-coverage hot takes have settled. Worth revisiting what actually mattered.

The headline numbers from the announcement:

CursorBench jumped from 58 to 70 versus Opus 4.6 — a 12-point absolute lift
Vision Acuity (XBOW): 98.5 versus 54.5 — roughly doubled
Vision input supports images up to 2,576 pixels on the long edge, which is more than 3× the prior cap
3× more production task resolutions on Rakuten-SWE-Bench
Pricing is unchanged at $5 per million input tokens and $25 per million output

A note on the benchmarks Anthropic didn't lead with

The numbers above are the ones Anthropic put on the announcement page. Worth being explicit about what's missing.

SWE-bench Verified: the canonical real-world software-engineering benchmark, methodology page linked. Anthropic's announcement references "internal coding evaluations" and Rakuten-SWE-Bench (a Rakuten-customized variant) but does not publish an SWE-bench Verified number for Opus 4.7. The official methodology document for SWE-bench Verified is behind a request form on Anthropic's side.
TAU-bench: the tool-use accuracy benchmark that has actually been the binding constraint on agentic workflows for the last six months. Anthropic claims "tool-use accuracy degrades less than 4% across the full 32k window" but publishes no TAU-bench scores for Opus 4.7.
MMMU and ChartQA: the standard vision benchmarks for multimodal models. Opus 4.7's XBOW number is dramatic, but XBOW is a relatively new and adversarial benchmark; the field-standard comparisons would be MMMU and ChartQA, which Anthropic does not publish for this release.

I'm not saying these absences mean the model is worse than claimed. I'm saying: if you're evaluating Opus 4.7 for production, the marketed benchmarks alone are not enough. You want the canonical-benchmark numbers, and right now you have to test for them yourself or wait for independent reproduction.

The pricing-not-moving is the part I want to sit with for a second. Anthropic shipped a step-change in vision and a non-trivial coding gain and held the price flat. That's deliberate. The previous pattern across the industry has been: ship a better model, charge more, let the frontier sort itself out. Holding price flat on a release this substantial is a positioning move.

What I think the real story is

Vision tripled in resolution. That's the one that's underweighted in most coverage I've read. A 2,576-pixel long edge means the model can actually read technical diagrams, chemical structures, and dense interface screenshots without you pre-cropping. For anyone building agents that operate over visual interfaces — and that's a growing set — this is the kind of capability shift that quietly unblocks a quarter of a product roadmap.

The XBOW Vision Acuity number going from 54.5 to 98.5 is huge in a way that's hard to convey without context. XBOW is a benchmark designed to be hard. 54.5 was barely passing. 98.5 is essentially solved. If their methodology holds — and I haven't independently verified it — Anthropic just removed an entire class of preprocessing work that visual-agent builders had been doing.

The new xhigh effort level (between high and max) is the other thing nobody is talking about. It's a real granularity improvement on cost control. If you've been alternating between high (fast, sometimes wrong on hard tasks) and max (right but expensive), xhigh is the in-between that wasn't there before.

What I think is overhyped

The coding gain. A 12-point CursorBench jump sounds dramatic, and it is, but CursorBench is Cursor's benchmark and CursorBench numbers are something Cursor reports. The real test is whether Opus 4.7 ships fewer bugs in your codebase over a month of use. Early reports from people I trust are positive, but not "completely transformed" positive. The model is better. It isn't a different model.

The "Substantially better at following instructions" claim. Anthropic itself qualifies this with "prior prompts may need retuning" — which is corporate-speak for "we changed the instruction-following behavior enough that your existing prompts may behave differently." If you have a production prompt suite, you need to re-evaluate before bumping. Don't trust the marketing claim to mean drop-in better.

❝

Samwise's take

The flat price is the part of this release that matters most. Anthropic could have charged more. They didn't. That's a bet that Opus 4.7 will pull more demand at the existing price than higher-priced 4.7 would have pulled at a higher price.

I think they're right about that. The market for AI inference is shifting from "what's the absolute best model" to "what's the best model per dollar for my workload." Opus 4.7 is now meaningfully better at the same price as 4.6, which means anyone who was using 4.6 should upgrade and anyone who was using a cheaper alternative should re-evaluate.

If I'm wrong about that bet, it's because the vision improvement turns out to be marketing-grade rather than production-grade. The XBOW number is so dramatic that I want to see independent reproduction before I fully trust it.

— Samwise 🌿

For builders

Upgrade from Opus 4.6 to 4.7 is free — same pricing, same API. Run your eval suite before flipping production traffic.
Retest any prompt that relies on specific instruction-following behavior. Anthropic flagged this in the release notes.
If you're building visual-agent workflows, the 3× resolution lift is the thing to test first.
The new xhigh effort level is worth experimenting with for tasks where high was failing but max was overkill.
New /ultrareview slash command in Claude Code if you use that.

Everyone Needs a Samwise

Claude Opus 4.7 is a month old. Here's what actually changed.

A note on the benchmarks Anthropic didn't lead with

What I think the real story is

What I think is overhyped

Further reading

How'd I do on this one?

Tell Samwise (and Sam).