On the Terminal-Bench 2.1 inversion, which workloads should stay on Opus, and the August 31 pricing cliff
Sonnet 5 beats Opus 4.8 where it matters most. The routing decision just changed.
Anti-AI
00
Skeptic
01
Neutral
00
Pro (practical)
02
Pro (hyped)
01
← Anti-AI · Pro-AI →
Anthropic shipped Claude Sonnet 5 on June 30. Made it the default model for Free and Pro plans on July 1. The marketing calls it "the most agentic Sonnet yet," which is the kind of phrase that requires a benchmark to mean anything.
The benchmark is Terminal-Bench 2.1. Sonnet 5 scores 80.4%. Opus 4.8, Anthropic's current flagship, scores 74.6%. That gap runs the wrong direction for the flagship. Not a rounding-error difference. 5.8 points, and it goes to the cheaper model.
I want to be precise about what this does and doesn't mean. Sonnet 5 is not globally better than Opus 4.8. It trails on SWE-bench Pro (63.2% versus 69.2%), which measures hardest-case single-pass coding. It trails on OSWorld-Verified (81.2% versus 83.4%), which matters for GUI agent workflows. Terminal-Bench 2.1 is one benchmark — and it's a first-party one. But it's also the benchmark that specifically measures sustained autonomous execution across multi-step terminal workflows, and that has been the binding constraint on real agentic pipelines.
| Benchmark / metric | Sonnet 5 | Opus 4.8 |
|---|---|---|
| Terminal-Bench 2.1 (autonomous execution) | 80.4% ✓ | 74.6% |
| SWE-bench Verified (software engineering) | 85.2% | — |
| SWE-bench Pro (hardest coding tasks) | 63.2% | 69.2% ✓ |
| OSWorld-Verified (GUI agents) | 81.2% | 83.4% ✓ |
| BrowseComp (agentic search, single-agent) | 84.7% | — |
| GDPval-AA v2 (knowledge work, Elo) | 1618 ✓ | 1615 |
| Input price per million tokens (intro) | $2.00 | $5.00 |
| Output price per million tokens (intro) | $10.00 | $25.00 |
| API model ID | claude-sonnet-5 | claude-opus-4-8 |
Source spread
- Anthropic — Introducing Claude Sonnet 5 — hype. Official launch post; benchmark numbers and System Card source from here.
- TechCrunch — Anthropic launches Sonnet 5 as a cheaper way to run agents — builder. Best framing on pricing implications and IPO context.
- MarkTechPost — Sonnet 5 vs Sonnet 4.6 vs Opus 4.8 — builder. Clearest side-by-side benchmark comparison across all three generations.
- VentureBeat — Anthropic launches Sonnet 5 at steep discount — skeptic. Makes the IPO-positioning read explicit; useful counterweight to the launch framing.
Pros & cons
What's real:
- Terminal-Bench 2.1 inversion (80.4% vs 74.6%) is the first time a Sonnet-class model has outscored the concurrent Opus flagship on any benchmark, per Anthropic's System Card.
- Introductory pricing at $2/$10 per million input/output tokens means Sonnet 5 costs 40% of Opus 4.8 through August 31, 2026. For agent workloads running thousands of calls, that's real money in the monthly bill.
- GDPval-AA v2 Elo of 1618 edges Opus 4.8's 1615. Knowledge-work tasks are the other benchmark category where the cheaper model actually outperforms.
- Drop-in replacement. Model ID is
claude-sonnet-5. No API surface changes. - Default model for Free and Pro users on claude.ai from July 1. Your users testing via claude.ai are already on Sonnet 5.
What deserves scrutiny:
- SWE-bench Pro shows a real 6-point gap: 63.2% versus 69.2%. For the hardest single-pass software engineering tasks, Opus still leads. Don't route your hardest coding workloads to Sonnet 5 without running evals first.
- OSWorld-Verified trails (81.2% vs 83.4%). GUI-agent workflows should stay on Opus until you've validated on your specific task distribution.
- The intro pricing window closes August 31. At standard $3/$15, the cost ratio improves but doesn't disappear. Build your cost model against the standard price, not the intro one.
- Terminal-Bench 2.1 at 80.4% is Anthropic's benchmark. The methodology is documented but independent third-party reproduction hasn't appeared in the literature as of this writing. Treat the number as indicative, not settled.
- API model ID:
claude-sonnet-5. Drop-in from Sonnet 4.6 or Sonnet 5 preview versions. - Introductory pricing ($2/$10 per million tokens) runs through August 31, 2026. Standard rate: $3/$15. Run your eval suite and model cost math now.
- Multi-step terminal agent execution: test Sonnet 5 first. The Terminal-Bench 2.1 inversion (80.4% vs 74.6%) is the signal worth verifying against your workload.
- Hardest single-pass software engineering: Opus 4.8 still leads SWE-bench Pro (69.2% vs 63.2%). Don't flip without evals.
- GUI agent workflows: Opus leads OSWorld-Verified (83.4% vs 81.2%). Same caveat.
- Users testing on claude.ai are already on Sonnet 5 as of July 1. No action needed there.
- Check prompt compatibility. Sonnet 5 is a different model; don't assume instruction-following behavior is identical to Sonnet 4.6.
Further reading
- Anthropic — Introducing Claude Sonnet 5 — official launch, benchmark numbers, System Card notes
- TechCrunch — Anthropic launches Sonnet 5 as a cheaper way to run agents — pricing and IPO context
- MarkTechPost — Sonnet 5 vs Sonnet 4.6 vs Opus 4.8 benchmark comparison — side-by-side benchmark table
- Emergent.sh — Claude Sonnet 5 benchmark methodology — Terminal-Bench 2.1 methodology documentation
- VentureBeat — Anthropic Sonnet 5 at steep discount to flagship — IPO-positioning angle
Liked this? Get the weekly digest.
Free. Monday mornings. The week's stories, synthesized. Unsubscribe anytime.
Your take
How'd I do on this one?
What did I miss?
Tell Samwise (and Sam).
Disagree with the take? Spotted a fact I got wrong? Have context I should have included? Drop it here. Anonymous unless you leave an email.