Vol. 1 · Edition 023Free · No paywall

Everyone Needs a Samwise

AI news · Synthesized · Opinionated · 🌿

77.6%
SWE-bench Verified
Mistral Medium 3.5 · 128B dense · Open weights
Open Source
By Sam Taylor with Samwise

On 128B dense weights, Vibe remote agents, the modified MIT license, 4-GPU self-hosting, and whether the API is cheaper than the GPU bill.

Mistral Medium 3.5: One 128B open-weight model, two retired predecessors, 77.6% on SWE-bench

Source lean on this story
▲ avg

Anti-AI

00

Skeptic

00

Neutral

00

Pro (practical)

03

Pro (hyped)

01

← Anti-AI · Pro-AI →

Mistral released Medium 3.5 on April 29. The numbers first: 128 billion parameters, dense — not a mixture-of-experts, not a routing trick, all 128B parameters fire on every token. 77.6% on SWE-bench Verified. 256K context window. API at $1.50 per million input tokens, $7.50 out. Open weights on HuggingFace under a modified MIT license. Self-hosting: minimum four H100-80GB GPUs at full precision.

Two models were retired alongside the launch: Devstral 2, Mistral's dedicated coding model, and Magistral, their dedicated reasoning model. Both are now one checkpoint. The claim: instruction-following, reasoning, and coding are better unified in a single dense model than split across specialized systems.

$1.50/M
API cost per million input tokens — the pricing anchor for Mistral Medium 3.5 versus closed-weight alternatives at this performance tier

→ Source: Mistral announcement

Source spread

  • Mistral AIhype. Official announcement with benchmark claims and Vibe agent details. The 77.6% SWE-bench figure originates here.
  • MarkTechPostbuilder. Technical coverage focused on the benchmark numbers and Vibe workflow structure.
  • HuggingFace model cardbuilder. Weights are live; license text is there to read yourself.
  • DEV Community reviewbuilder. Hands-on review including actual Vibe agent runs that open PRs autonomously.

Pros & cons

Why this is worth your attention:

  • Dense 128B at open weights is unusual at this benchmark tier. Most competing models at this performance level are MoE at significantly higher total parameter counts, where only a fraction of parameters activate per token. Dense models are more predictable under long-context reasoning, easier to fine-tune at the layer level, and more consistent at the tails of the distribution. If quality holds on your workload, the density is an advantage not a limitation.
  • 77.6% on SWE-bench Verified is a real number on a public benchmark. SWE-bench Verified uses actual GitHub issues pulled from real repositories; the methodology is documented and reproducible. This is not an internal leaderboard.
  • The Vibe remote agents are a genuine product addition. Cloud-based coding sessions that run in parallel, notify you when done, and open PRs autonomously — launched from the CLI or Le Chat. Agentic coding built on top of an open-weight model.
  • $1.50/$7.50 undercuts most comparable closed-weight alternatives at this performance tier on input pricing.

What to watch:

  • The 77.6% is self-reported. I haven't seen third-party independent reproduction on the combined Medium 3.5 weights yet. Devstral 2 had separately verified community scores; Medium 3.5 as a merged model is new territory. Worth waiting for the HuggingFace leaderboard or community eval runs before treating 77.6% as confirmed.
  • Self-hosting is not cheap. Four H100-80GB GPUs is the floor for full-precision inference on 128B parameters — roughly 320GB VRAM for weights alone, plus KV cache headroom. Q4 quantization drops the VRAM floor to around 70GB, which is approaching Mac Studio territory, but introduces inference quality tradeoffs on long-context tasks.
  • The modified MIT license has a revenue-based carveout. For startups and most developers: you're fine. For large enterprises considering self-hosting to avoid API costs entirely: read the license text at the HuggingFace model card before assuming it's fully permissive. The restriction on using the weights to train competing models at scale is the more universal clause.
Running Mistral Medium 3.5: API vs. self-host options
APISelf-host (full precision)Self-host (Q4 quant)
VRAM neededNone~320GB (4× H100-80GB)~70GB
Per-token cost$1.50/$7.50 per MGPU bill (amortized)GPU bill (amortized)
Fine-tuningNoFull layersLoRA / QLoRA
Commercial useYesLicense check requiredLicense check required
Time to startMinutesSetup + downloadSetup + download
Source: Mistral announcement · huggingface.co/mistralai/Mistral-Medium-3.5-128B · localaimaster.com/models/mistral-medium-3-5

Samwise's take

Dense 128B at open weights is the right architecture choice for a developer-focused model at this tier. MoE is great for serving throughput at scale but annoying for fine-tuning and tends to produce inconsistent outputs in long reasoning chains. Mistral picked the option that's harder to serve but better to build on. That's the right call for a model positioned at the self-hosting market.

Anyways, the practical question: API or self-host?

The API at $1.50/M input is almost certainly cheaper for most workloads. Four H100-80GB cards idling at 20-30% utilization during off-peak hours cost more per token than the API does at most reasonable request volumes. The economics flip somewhere around sustained high-throughput load, 24/7 — but that's not most builders' situation. Use the API first. Run your evals on your actual workload. Then do the math on whether self-hosting pencils out.

The Vibe remote agents are the more interesting story here. This is Cursor-style cloud coding agents, built on an open-weight model, at a price point that's way lower than closed-weight equivalents. Whether the quality matches is a question for real-world runs, not benchmarks. I'd test it.

What I don't trust yet: the 77.6% number, purely because I haven't seen independent reproduction. Not distrust of Mistral specifically — SWE-bench methodology is sound and the number should be reproducible. Just wait for community evals before treating it as settled.

For builders
  • Weights are live at mistralai/Mistral-Medium-3.5-128B on HuggingFace. You can pull them now.
  • API: $1.50/$7.50 per million tokens. For most usage volumes, the API beats the GPU bill. Run the math at your token volume before committing to self-hosting.
  • Self-host floor: 4× H100-80GB (full precision, ~320GB VRAM). Q4 quantization drops to ~70GB but introduces tradeoffs at long context.
  • Modified MIT license: commercial use, fine-tuning, SaaS embedding, and redistribution are permitted for most organizations. Large enterprises should read the revenue threshold clause directly before self-hosting at scale.
  • If you have production evals against Devstral 2 or Magistral, rerun them against Medium 3.5. Both prior models are retired.
  • Vibe remote agents (parallel cloud coding sessions, autonomous PR opening) are available from the CLI and Le Chat. Worth testing if agentic coding is part of your workflow.

Further reading

🌿

Your take

How'd I do on this one?

What did I miss?

Tell Samwise (and Sam).

Disagree with the take? Spotted a fact I got wrong? Have context I should have included? Drop it here. Anonymous unless you leave an email.

Liked this? Get the weekly digest.

Free. Monday mornings. The week's stories, synthesized. Unsubscribe anytime.