Vol. 1 · Edition 025Free · No paywall

Everyone Needs a Samwise

AI news · Synthesized · Opinionated · 🌿

Paper claims

3-bit

Production reality

4-bit
Paper
By Sam Taylor with Samwise

On Zandieh et al.'s ICLR 2026 paper, the Red Hat AI / vLLM May 2026 accuracy study, and which configuration actually holds in production

Google's KV cache paper got one number wrong. The corrected version is still worth running.

Source lean on this story
▲ avg

Anti-AI

00

Skeptic

01

Neutral

02

Pro (practical)

02

Pro (hyped)

00

← Anti-AI · Pro-AI →

The KV cache is boring right up until it isn't.

You're serving a 70B model at 32k context. Traffic spikes. The H100 allocation hits the ceiling. Someone on your team digs up a Google Research preprint from April 2025 — TurboQuant, arXiv:2504.19874, accepted at ICLR 2026 — claiming 5× compression with near-zero accuracy loss at 3 bits per coordinate. The authors are Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni across Google Research, NYU, and Google DeepMind. Open-source implementations are landing in llama.cpp and the vLLM PR #38479 is merged.

So you dig in. And the first thing you find is that the 3-bit claim is where the paper and production diverge.

What TurboQuant actually does

The algorithm combines two ideas. PolarQuant applies a random rotation to key-value vectors before scalar quantization. Quantized Johnson-Lindenstrauss — QJL — adds a 1-bit residual correction on top. Together: KV cache compressed to 3 bits per element versus the standard 16-bit FP16. The math is genuinely sound. The paper proves the method sits within 2.7× of the information-theoretic distortion minimum for any given bit budget. Near-optimal by a rigorous definition.

The operational advantage is real too. Data-oblivious: no calibration dataset, no fine-tuning, no model-specific training. Load any GGUF-format transformer and the compression applies immediately.

The problem is the QJL residual step at low bit widths. At 3 bits, the residual correction introduces error that compounds across long attention sequences. Red Hat AI published a comprehensive evaluation on May 11 — Eldar Kurtić, Michael Goin, and Alexandre Marques — testing Llama-3.3-70B, Qwen3-30B-A3B, and MiniMax-M2.7 across five benchmarks: MRCR for long-context retrieval, AIME25 and GPQA for reasoning, MATH500, LiveCodeBench-v6.

The finding: 3-bit with QJL drops 15–25 points on reasoning at 128K+ context.

15–25 pts
Accuracy drop on reasoning tasks (AIME25/LiveCodeBench) at 3-bit with QJL, 128K+ context

→ Source: Red Hat AI / vLLM, May 2026

That's not "slightly worse." That's materially worse on the benchmarks where LLM-serving customers are increasingly running real workloads. Math tutoring. Code generation. Legal document analysis. These are reasoning tasks at long context. This is where the 3-bit claim breaks.

What the production configuration actually is

The Red Hat recommendation: use 4-bit without QJL. The _nc variant (norm correction, QJL disabled), first two and last two attention layers kept at full FP16 precision. That delivers 3–4× memory reduction — not 5× — with accuracy near full precision on both reasoning and long-context tasks.

TurboQuant: paper claims vs production findings
ConfigurationMemory reductionReasoning accuracy at 128K+Recommendation
3-bit with QJL~5×15–25 pt drop vs FP16Avoid for reasoning workloads
4-bit without QJL (_nc)~3–4×Near full precisionProduction default
FP8 (baseline)~2×Near full precisionIf <2× compression is enough

The useful math: for a cluster serving 70B+ models at 32K+ context, the annual savings estimate at 4-bit runs roughly $267,840 per serving cluster. At ten clusters, that's $2.6M/year from one configuration change and no retraining. The 5× paper number would be better. The 3–4× production number is still real money.

Source spread

Pros & cons

What's real:

  • 4-bit _nc achieves 3–4× memory reduction at accuracy near full precision. That is a genuine production improvement for high-throughput long-context serving.
  • Data-oblivious compression with no calibration or training requirement. This separates TurboQuant from most quantization methods, which need model-specific calibration datasets and can drift with model updates.
  • The mathematical foundation is rigorous. The information-theoretic bound is a real result, not marketing.
  • Active integrations: llama.cpp PR is active, vLLM #38479 is merged, AMD ROCm variant exists. The ecosystem is building around 4-bit _nc configurations.

What deserves a side-eye:

  • Google did not publish a comprehensive accuracy study with failure conditions alongside the ICLR paper. Only first-party benchmarks on shorter-context non-reasoning tasks. Red Hat did the work the paper should have included.
  • The headline claim — near-zero accuracy loss at 3-bit — is technically true on the paper's specific test conditions, which happen to be the conditions where the failure mode doesn't surface. This is a common ML paper failure pattern, not unique to this group.
  • The QJL residual correction, the clever part of the method, is also the part that breaks at low bit widths on reasoning tasks. That's an irony worth naming.

What builders need to know

  • Use 4-bit _nc (norm correction, QJL disabled), not 3-bit. The 3-bit configuration drops 15–25 points on reasoning at 128K+ context — per the Red Hat AI / vLLM May 2026 evaluation.
  • Keep the first and last two attention layers at full precision. These layers carry disproportionate signal; quantizing them degrades the model more than any other two layers.
  • Sweet spot: 70B+ models at 32K–128K context length in high-throughput serving. Smaller models and shorter contexts change the calculus — run your own eval.
  • If you only need 2× compression, FP8 is simpler and more battle-tested. TurboQuant is the right tool when >2× is the requirement.
  • Active integrations to track: llama.cpp discussion #20969, vLLM PR #38479. Both are moving. AMD ROCm build exists via Pascal-SAPUI5/llama.cpp-turboquant.
  • Do not trust "near-zero accuracy loss" from paper benchmarks alone. Run the vLLM evaluation suite against your workload. Reasoning tasks and long context are where the failure shows up.

Further reading

🌿

Liked this? Get the weekly digest.

Free. Monday mornings. The week's stories, synthesized. Unsubscribe anytime.

Your take

How'd I do on this one?

What did I miss?

Tell Samwise (and Sam).

Disagree with the take? Spotted a fact I got wrong? Have context I should have included? Drop it here. Anonymous unless you leave an email.