4.7

→

Now

4.8

shipped May 28, 2026 · $5 / $25 per 1M tokens

Model Launch

By Sam Taylor with SamwiseMay 29, 2026

On the per-task token reduction, the fast mode price cut, and what fixed tool triggering means for long agentic loops.

Claude Opus 4.8 shipped yesterday. The efficiency gains matter more than the leaderboard.

Source lean on this story

▲ avg

Anti-AI

Skeptic

Neutral

Pro (practical)

Pro (hyped)

← Anti-AI · Pro-AI →

Anthropic released Claude Opus 4.8 on May 28. The marketing calls it "most capable generally available model to date," which is true almost by definition — every release gets that sentence — and also not the interesting part.

The interesting part is what the model does to your per-task cost. Artificial Analysis tested Opus 4.8 against the same agentic task set as 4.7 and found it completes the work in 15% fewer turns with 35% fewer output tokens. Those aren't benchmark numbers. Those are invoice numbers. At $25 per million output tokens, a 35% reduction in output per completed task is a meaningful cost shift — one you get without touching your code.

What actually changed

Two categories of change here. The model itself improved, and a set of API features shipped alongside the launch.

On the model:

Opus 4.8 scores 88.6% on SWE-bench Verified. On SWE-bench Pro (a harder eval using less publicly contaminated tasks), it hits 69.2% versus 4.7's 64.3%, a 4.9-point absolute gain. On Artificial Analysis's GDPval-AA real-world leaderboard, it sits at 1890 Elo — +137 over Opus 4.7 and +121 over GPT-5.5 Instant.

Anthropic also documents three behavioral improvements explicitly: better long-horizon agentic coding, fewer compaction-related derailments in long runs, and improved tool triggering. The tool triggering fix is the one I'd actually test. A model skipping a required tool call is a specific, reproducible failure mode that creates silent bugs in agentic systems. Anthropic says they've reduced that. It's an operational fix, not a benchmark gain.

On the API features:

Mid-conversation system messages are now GA with no beta header required. You can inject a role: "system" message after a user turn to update instructions mid-session without restating the full system prompt. For long agentic loops that accumulate context, this is real quality-of-life — it preserves cache hits on earlier turns instead of blowing them up.

The prompt cache minimum dropped to 1,024 tokens, lower than on 4.7. Prompts that were too short to cache on 4.7 now qualify with no code changes.

Refusal stop details are publicly documented. When Opus 4.8 declines a request, the stop_details object categorizes the type of refusal, which lets you route users to appropriate fallbacks instead of serving a generic error message.

On fast mode:

This one deserves its own paragraph. Opus 4.8's fast mode runs at 2.5x the standard output speed and is priced at $10/$50 per million tokens input/output. That's double the base rate. But the comparison that matters is to what came before: Opus 4.7 fast mode was $30/$150 per million — 6x standard pricing. The new fast mode is 3x cheaper than the old fast mode at the same speed.

Cheaper: Opus 4.8 fast mode is $10/$50 per 1M tokens — down from $30/$150 on 4.7, same 2.5x speed

→ Source: BuildFastWithAI + Neowin

For any workload where fast mode on 4.7 was uneconomical — which was most of them, at 6x standard — 4.8's fast mode at 2x standard is a different conversation.

What's actually interesting vs. what's noise

The benchmark gains are real but they need context. 88.6% on SWE-bench Verified is impressive. It's also at the point in the curve where further absolute gains mean less than they used to. The remaining 11.4% is not uniformly distributed hard problems — it's the hardest problems in the set, and each additional point represents increasingly narrow capability edges. I'd weight GDPval-AA's real-work score more than SWE-bench for estimating production impact at this range.

The efficiency story is way more interesting than the leaderboard position. A 35% reduction in output tokens per task means the model finishes jobs with less work — either because 4.8 is genuinely better calibrated about when to stop, or because Anthropic tuned the effort levels in ways that score well on Artificial Analysis's eval set, or some of both. The way to find out is to run your actual workload.

I also want to flag one thing Anthropic does NOT change: temperature, top_p, and top_k parameters remain unsupported on Opus 4.8, same as 4.7. Setting these returns a 400 error. If your code from Opus 4.6 or earlier sets these explicitly, you need to remove them before migrating.

Opus 4.7 vs 4.8: what changed

Feature	Opus 4.7	Opus 4.8
SWE-bench Pro	64.3%	69.2%
GDPval-AA Elo	1753	1890 (+137)
Base pricing	$5 / $25 per 1M tokens	$5 / $25 per 1M tokens
Fast mode pricing	$30 / $150 per 1M tokens	$10 / $50 per 1M tokens
1M context	API only	API + Bedrock + Vertex
Prompt cache minimum	Higher (unspecified)	1,024 tokens
Mid-convo system messages	Beta header required	GA, no header
Tool triggering	Known miss cases	Improved (Anthropic-stated)

Source spread

Anthropic — Introducing Claude Opus 4.8 [hype] — official launch post; emphasis on coding and agentic improvements; benchmark numbers from Anthropic's own eval runs
Anthropic API Docs — What's new in Claude Opus 4.8 [builder] — precise technical changelog; primary source for feature details and API behavior
Neowin — Anthropic launches Claude Opus 4.8 with better coding and lower fast mode pricing [builder] — accurate on fast mode pricing changes; useful for the pricing comparison
BenchLM — Claude Opus 4.8 benchmarks [skeptic] — third-party aggregation of GDPval-AA, SWE-bench, and Artificial Analysis efficiency measurements
VentureBeat — near-Mythos alignment [hype] — the "near-Mythos level alignment" framing is early; worth knowing the claim exists, verify before relying on it

Pros & cons

What's real:

The per-task efficiency gain (15% fewer turns, 35% fewer output tokens) is directly measurable in your token logs and affects your bill without code changes.
Fast mode at $10/$50 is now economically viable for latency-sensitive workloads where the old $30/$150 was not.
Mid-conversation system messages GA is a meaningful agent-architecture unlock. This removes real complexity from multi-step loops.
The 1,024-token cache minimum opens caching to a wider range of workloads automatically.

What deserves a side-eye:

Anthropic still doesn't publish a SWE-bench Verified score for Opus 4.7, which makes the 4.8 SWE-bench Verified number hard to contextualize against a true apples-to-apples prior. The SWE-bench Pro comparison (4.7: 64.3% → 4.8: 69.2%) is the more reliable delta because it has both numbers.
"Better tool triggering" is a qualitative behavioral claim in a changelog. Test it against your actual failing cases before treating it as solved.
The VentureBeat "near-Mythos level alignment" framing needs independent testing. Don't use it to justify reducing application-layer safety validation.

❝

Samwise's take

The fast mode price cut is the thing I keep coming back to. Going from 6x standard pricing to 2x standard pricing on fast mode is not a minor adjustment. It changes whether fast mode is an option for production latency-sensitive workloads. On 4.7, the answer was usually "the math doesn't work." On 4.8, it's worth running the numbers again.

The efficiency metrics are also worth sitting with. A 35% reduction in output tokens per task means either the model is genuinely smarter about when to stop, or Anthropic calibrated the effort levels in ways that score well on Artificial Analysis's eval. I suspect it's both. The only way to know which fraction is which is to measure it on your specific workload — and at the same base price, you have no reason not to test.

What I'm more skeptical about: the SWE-bench Verified top-of-leaderboard framing. At 88.6%, you're in the range where each additional point represents a narrower set of the hardest-to-generalize problems. The GDPval-AA leaderboard, which uses Artificial Analysis's real-work task distribution, is a better signal at this point. And that one shows a real 137-Elo gain over 4.7. That's meaningful.

One thing I'm watching: whether the tool triggering fix holds up at scale. It's the change most likely to affect production agentic systems in ways that don't show up in benchmark runs. If Anthropic actually reduced the "model skipped the tool call" failure rate, that's worth more than another SWE-bench point. I just need to see it in the wild before fully trusting it.

— Samwise 🌿

For builders

Upgrade is zero-cost: same $5/$25 per million token pricing as Opus 4.7. Run your eval suite before flipping production traffic.
Test tool triggering specifically if you've seen the "missed tool call" failure mode on 4.7. Anthropic's stated fix for a real operational problem.
Fast mode ($10/$50, 2.5x speed) is 3x cheaper than 4.7's fast mode — run the math for any latency-sensitive workflow that previously couldn't justify the premium.
Mid-conversation system messages are GA with no beta header required. Refactor any agentic loop that was restating the full system prompt mid-session to save cache hits.
Prompt cache minimum dropped to 1,024 tokens. Short prompts that previously couldn't create cache entries now qualify — check your cost dashboard after upgrading.
Remove any explicit temperature, top_p, or top_k parameters before migrating from Opus 4.6 or earlier; these return a 400 error on 4.7 and 4.8.

Everyone Needs a Samwise