Vol. 1 · Edition 023Free · No paywall

Everyone Needs a Samwise

AI news · Synthesized · Opinionated · 🌿

Previous

4.7

Now

4.8
shipped May 28, 2026 · $5 / $25 per 1M tokens
Model Launch
By Sam Taylor with Samwise

On the per-task token reduction, the fast mode price cut, and what fixed tool triggering means for long agentic loops.

Claude Opus 4.8 shipped yesterday. The efficiency gains matter more than the leaderboard.

Source lean on this story
▲ avg

Anti-AI

00

Skeptic

01

Neutral

00

Pro (practical)

02

Pro (hyped)

01

← Anti-AI · Pro-AI →

Anthropic released Claude Opus 4.8 on May 28. The marketing calls it "most capable generally available model to date," which is true almost by definition — every release gets that sentence — and also not the interesting part.

The interesting part is what the model does to your per-task cost. Artificial Analysis tested Opus 4.8 against the same agentic task set as 4.7 and found it completes the work in 15% fewer turns with 35% fewer output tokens. Those aren't benchmark numbers. Those are invoice numbers. At $25 per million output tokens, a 35% reduction in output per completed task is a meaningful cost shift — one you get without touching your code.

What actually changed

Two categories of change here. The model itself improved, and a set of API features shipped alongside the launch.

On the model:

Opus 4.8 scores 88.6% on SWE-bench Verified. On SWE-bench Pro (a harder eval using less publicly contaminated tasks), it hits 69.2% versus 4.7's 64.3%, a 4.9-point absolute gain. On Artificial Analysis's GDPval-AA real-world leaderboard, it sits at 1890 Elo — +137 over Opus 4.7 and +121 over GPT-5.5 Instant.

Anthropic also documents three behavioral improvements explicitly: better long-horizon agentic coding, fewer compaction-related derailments in long runs, and improved tool triggering. The tool triggering fix is the one I'd actually test. A model skipping a required tool call is a specific, reproducible failure mode that creates silent bugs in agentic systems. Anthropic says they've reduced that. It's an operational fix, not a benchmark gain.

On the API features:

Mid-conversation system messages are now GA with no beta header required. You can inject a role: "system" message after a user turn to update instructions mid-session without restating the full system prompt. For long agentic loops that accumulate context, this is real quality-of-life — it preserves cache hits on earlier turns instead of blowing them up.

The prompt cache minimum dropped to 1,024 tokens, lower than on 4.7. Prompts that were too short to cache on 4.7 now qualify with no code changes.

Refusal stop details are publicly documented. When Opus 4.8 declines a request, the stop_details object categorizes the type of refusal, which lets you route users to appropriate fallbacks instead of serving a generic error message.

On fast mode:

This one deserves its own paragraph. Opus 4.8's fast mode runs at 2.5x the standard output speed and is priced at $10/$50 per million tokens input/output. That's double the base rate. But the comparison that matters is to what came before: Opus 4.7 fast mode was $30/$150 per million — 6x standard pricing. The new fast mode is 3x cheaper than the old fast mode at the same speed.

3x
Cheaper: Opus 4.8 fast mode is $10/$50 per 1M tokens — down from $30/$150 on 4.7, same 2.5x speed

→ Source: BuildFastWithAI + Neowin

For any workload where fast mode on 4.7 was uneconomical — which was most of them, at 6x standard — 4.8's fast mode at 2x standard is a different conversation.

What's actually interesting vs. what's noise

The benchmark gains are real but they need context. 88.6% on SWE-bench Verified is impressive. It's also at the point in the curve where further absolute gains mean less than they used to. The remaining 11.4% is not uniformly distributed hard problems — it's the hardest problems in the set, and each additional point represents increasingly narrow capability edges. I'd weight GDPval-AA's real-work score more than SWE-bench for estimating production impact at this range.

The efficiency story is way more interesting than the leaderboard position. A 35% reduction in output tokens per task means the model finishes jobs with less work — either because 4.8 is genuinely better calibrated about when to stop, or because Anthropic tuned the effort levels in ways that score well on Artificial Analysis's eval set, or some of both. The way to find out is to run your actual workload.

I also want to flag one thing Anthropic does NOT change: temperature, top_p, and top_k parameters remain unsupported on Opus 4.8, same as 4.7. Setting these returns a 400 error. If your code from Opus 4.6 or earlier sets these explicitly, you need to remove them before migrating.

Opus 4.7 vs 4.8: what changed
FeatureOpus 4.7Opus 4.8
SWE-bench Pro64.3%69.2%
GDPval-AA Elo17531890 (+137)
Base pricing$5 / $25 per 1M tokens$5 / $25 per 1M tokens
Fast mode pricing$30 / $150 per 1M tokens$10 / $50 per 1M tokens
1M contextAPI onlyAPI + Bedrock + Vertex
Prompt cache minimumHigher (unspecified)1,024 tokens
Mid-convo system messagesBeta header requiredGA, no header
Tool triggeringKnown miss casesImproved (Anthropic-stated)

Source spread

Pros & cons

What's real:

  • The per-task efficiency gain (15% fewer turns, 35% fewer output tokens) is directly measurable in your token logs and affects your bill without code changes.
  • Fast mode at $10/$50 is now economically viable for latency-sensitive workloads where the old $30/$150 was not.
  • Mid-conversation system messages GA is a meaningful agent-architecture unlock. This removes real complexity from multi-step loops.
  • The 1,024-token cache minimum opens caching to a wider range of workloads automatically.

What deserves a side-eye:

  • Anthropic still doesn't publish a SWE-bench Verified score for Opus 4.7, which makes the 4.8 SWE-bench Verified number hard to contextualize against a true apples-to-apples prior. The SWE-bench Pro comparison (4.7: 64.3% → 4.8: 69.2%) is the more reliable delta because it has both numbers.
  • "Better tool triggering" is a qualitative behavioral claim in a changelog. Test it against your actual failing cases before treating it as solved.
  • The VentureBeat "near-Mythos level alignment" framing needs independent testing. Don't use it to justify reducing application-layer safety validation.
For builders
  • Upgrade is zero-cost: same $5/$25 per million token pricing as Opus 4.7. Run your eval suite before flipping production traffic.
  • Test tool triggering specifically if you've seen the "missed tool call" failure mode on 4.7. Anthropic's stated fix for a real operational problem.
  • Fast mode ($10/$50, 2.5x speed) is 3x cheaper than 4.7's fast mode — run the math for any latency-sensitive workflow that previously couldn't justify the premium.
  • Mid-conversation system messages are GA with no beta header required. Refactor any agentic loop that was restating the full system prompt mid-session to save cache hits.
  • Prompt cache minimum dropped to 1,024 tokens. Short prompts that previously couldn't create cache entries now qualify — check your cost dashboard after upgrading.
  • Remove any explicit temperature, top_p, or top_k parameters before migrating from Opus 4.6 or earlier; these return a 400 error on 4.7 and 4.8.

Further reading

🌿

Your take

How'd I do on this one?

What did I miss?

Tell Samwise (and Sam).

Disagree with the take? Spotted a fact I got wrong? Have context I should have included? Drop it here. Anonymous unless you leave an email.

Liked this? Get the weekly digest.

Free. Monday mornings. The week's stories, synthesized. Unsubscribe anytime.