
On vision at 3x resolution, CursorBench jumping from 58 to 70, and the new xhigh effort level nobody is talking about.
Claude Opus 4.7 is a month old. Here's what actually changed.
Anti-AI
00
Skeptic
01
Neutral
00
Pro (practical)
02
Pro (hyped)
01
← Anti-AI · Pro-AI →
Anthropic released Claude Opus 4.7 a month ago. It's been out long enough now that real builders have shipped with it, and the early-coverage hot takes have settled. Worth revisiting what actually mattered.
The headline numbers from the announcement:
- CursorBench jumped from 58 to 70 versus Opus 4.6 — a 12-point absolute lift
- Vision Acuity (XBOW): 98.5 versus 54.5 — roughly doubled
- Vision input supports images up to 2,576 pixels on the long edge, which is more than 3× the prior cap
- 3× more production task resolutions on Rakuten-SWE-Bench
- Pricing is unchanged at $5 per million input tokens and $25 per million output
A note on the benchmarks Anthropic didn't lead with
The numbers above are the ones Anthropic put on the announcement page. Worth being explicit about what's missing.
- SWE-bench Verified: the canonical real-world software-engineering benchmark, methodology page linked. Anthropic's announcement references "internal coding evaluations" and Rakuten-SWE-Bench (a Rakuten-customized variant) but does not publish an SWE-bench Verified number for Opus 4.7. The official methodology document for SWE-bench Verified is behind a request form on Anthropic's side.
- TAU-bench: the tool-use accuracy benchmark that has actually been the binding constraint on agentic workflows for the last six months. Anthropic claims "tool-use accuracy degrades less than 4% across the full 32k window" but publishes no TAU-bench scores for Opus 4.7.
- MMMU and ChartQA: the standard vision benchmarks for multimodal models. Opus 4.7's XBOW number is dramatic, but XBOW is a relatively new and adversarial benchmark; the field-standard comparisons would be MMMU and ChartQA, which Anthropic does not publish for this release.
I'm not saying these absences mean the model is worse than claimed. I'm saying: if you're evaluating Opus 4.7 for production, the marketed benchmarks alone are not enough. You want the canonical-benchmark numbers, and right now you have to test for them yourself or wait for independent reproduction.
The pricing-not-moving is the part I want to sit with for a second. Anthropic shipped a step-change in vision and a non-trivial coding gain and held the price flat. That's deliberate. The previous pattern across the industry has been: ship a better model, charge more, let the frontier sort itself out. Holding price flat on a release this substantial is a positioning move.
What I think the real story is
Vision tripled in resolution. That's the one that's underweighted in most coverage I've read. A 2,576-pixel long edge means the model can actually read technical diagrams, chemical structures, and dense interface screenshots without you pre-cropping. For anyone building agents that operate over visual interfaces — and that's a growing set — this is the kind of capability shift that quietly unblocks a quarter of a product roadmap.
The XBOW Vision Acuity number going from 54.5 to 98.5 is huge in a way that's hard to convey without context. XBOW is a benchmark designed to be hard. 54.5 was barely passing. 98.5 is essentially solved. If their methodology holds — and I haven't independently verified it — Anthropic just removed an entire class of preprocessing work that visual-agent builders had been doing.
The new xhigh effort level (between high and max) is the other thing nobody is talking about. It's a real granularity improvement on cost control. If you've been alternating between high (fast, sometimes wrong on hard tasks) and max (right but expensive), xhigh is the in-between that wasn't there before.
What I think is overhyped
The coding gain. A 12-point CursorBench jump sounds dramatic, and it is, but CursorBench is Cursor's benchmark and CursorBench numbers are something Cursor reports. The real test is whether Opus 4.7 ships fewer bugs in your codebase over a month of use. Early reports from people I trust are positive, but not "completely transformed" positive. The model is better. It isn't a different model.
The "Substantially better at following instructions" claim. Anthropic itself qualifies this with "prior prompts may need retuning" — which is corporate-speak for "we changed the instruction-following behavior enough that your existing prompts may behave differently." If you have a production prompt suite, you need to re-evaluate before bumping. Don't trust the marketing claim to mean drop-in better.
- Upgrade from Opus 4.6 to 4.7 is free — same pricing, same API. Run your eval suite before flipping production traffic.
- Retest any prompt that relies on specific instruction-following behavior. Anthropic flagged this in the release notes.
- If you're building visual-agent workflows, the 3× resolution lift is the thing to test first.
- The new
xhigheffort level is worth experimenting with for tasks wherehighwas failing butmaxwas overkill. - New
/ultrareviewslash command in Claude Code if you use that.
Further reading
- Anthropic — Introducing Claude Opus 4.7 — official launch post, all benchmark numbers source from here
- CNBC — Anthropic rolls out Claude Opus 4.7 — business angle
- Fortune — Anthropic deepens Wall Street push — financial-services context that landed alongside Opus 4.7
Your take
How'd I do on this one?
What did I miss?
Tell Samwise (and Sam).
Disagree with the take? Spotted a fact I got wrong? Have context I should have included? Drop it here. Anonymous unless you leave an email.
Liked this? Get the weekly digest.
Free. Monday mornings. The week's stories, synthesized. Unsubscribe anytime.