Paper claims

3-bit

Production reality

4-bit

Paper

By Sam Taylor with SamwiseJun 16, 2026

On Zandieh et al.'s ICLR 2026 paper, the Red Hat AI / vLLM May 2026 accuracy study, and which configuration actually holds in production

Google's KV cache paper got one number wrong. The corrected version is still worth running.

Source lean on this story

▲ avg

Anti-AI

Skeptic

Neutral

Pro (practical)

Pro (hyped)

← Anti-AI · Pro-AI →

The KV cache is boring right up until it isn't.

You're serving a 70B model at 32k context. Traffic spikes. The H100 allocation hits the ceiling. Someone on your team digs up a Google Research preprint from April 2025 — TurboQuant, arXiv:2504.19874, accepted at ICLR 2026 — claiming 5× compression with near-zero accuracy loss at 3 bits per coordinate. The authors are Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni across Google Research, NYU, and Google DeepMind. Open-source implementations are landing in llama.cpp and the vLLM PR #38479 is merged.

So you dig in. And the first thing you find is that the 3-bit claim is where the paper and production diverge.

What TurboQuant actually does

The algorithm combines two ideas. PolarQuant applies a random rotation to key-value vectors before scalar quantization. Quantized Johnson-Lindenstrauss — QJL — adds a 1-bit residual correction on top. Together: KV cache compressed to 3 bits per element versus the standard 16-bit FP16. The math is genuinely sound. The paper proves the method sits within 2.7× of the information-theoretic distortion minimum for any given bit budget. Near-optimal by a rigorous definition.

The operational advantage is real too. Data-oblivious: no calibration dataset, no fine-tuning, no model-specific training. Load any GGUF-format transformer and the compression applies immediately.

The problem is the QJL residual step at low bit widths. At 3 bits, the residual correction introduces error that compounds across long attention sequences. Red Hat AI published a comprehensive evaluation on May 11 — Eldar Kurtić, Michael Goin, and Alexandre Marques — testing Llama-3.3-70B, Qwen3-30B-A3B, and MiniMax-M2.7 across five benchmarks: MRCR for long-context retrieval, AIME25 and GPQA for reasoning, MATH500, LiveCodeBench-v6.

The finding: 3-bit with QJL drops 15–25 points on reasoning at 128K+ context.

15–25 pts

Accuracy drop on reasoning tasks (AIME25/LiveCodeBench) at 3-bit with QJL, 128K+ context

→ Source: Red Hat AI / vLLM, May 2026

That's not "slightly worse." That's materially worse on the benchmarks where LLM-serving customers are increasingly running real workloads. Math tutoring. Code generation. Legal document analysis. These are reasoning tasks at long context. This is where the 3-bit claim breaks.

What the production configuration actually is

The Red Hat recommendation: use 4-bit without QJL. The _nc variant (norm correction, QJL disabled), first two and last two attention layers kept at full FP16 precision. That delivers 3–4× memory reduction — not 5× — with accuracy near full precision on both reasoning and long-context tasks.

TurboQuant: paper claims vs production findings

Configuration	Memory reduction	Reasoning accuracy at 128K+	Recommendation
3-bit with QJL	~5×	15–25 pt drop vs FP16	Avoid for reasoning workloads
4-bit without QJL (_nc)	~3–4×	Near full precision	Production default
FP8 (baseline)	~2×	Near full precision	If <2× compression is enough

The useful math: for a cluster serving 70B+ models at 32K+ context, the annual savings estimate at 4-bit runs roughly $267,840 per serving cluster. At ten clusters, that's $2.6M/year from one configuration change and no retraining. The 5× paper number would be better. The 3–4× production number is still real money.

Source spread

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — arXiv [academic] The primary paper, ICLR 2026. Claims 5× compression near-zero loss at 3.5 bits. Information-theoretic proof is solid; production generalization is where the paper oversells.
A First Comprehensive Study of TurboQuant — vLLM blog [builder] Red Hat AI, May 11, 2026. The most rigorous independent evaluation available. Three models, five benchmarks. Source of the 15–25 pt finding.
TurboQuant: ~3-bit KV Cache with Near 0 Accuracy Loss? — kaitchup [skeptic] Early community evaluation flagging the QJL problem weeks before the Red Hat study. Accurate early read.
OnlyTerp/turboquant — GitHub [builder] Open-source implementation with production configuration notes. First OSS release; the repo readme flags the 4-bit recommendation explicitly.

Pros & cons

What's real:

4-bit _nc achieves 3–4× memory reduction at accuracy near full precision. That is a genuine production improvement for high-throughput long-context serving.
Data-oblivious compression with no calibration or training requirement. This separates TurboQuant from most quantization methods, which need model-specific calibration datasets and can drift with model updates.
The mathematical foundation is rigorous. The information-theoretic bound is a real result, not marketing.
Active integrations: llama.cpp PR is active, vLLM #38479 is merged, AMD ROCm variant exists. The ecosystem is building around 4-bit _nc configurations.

What deserves a side-eye:

Google did not publish a comprehensive accuracy study with failure conditions alongside the ICLR paper. Only first-party benchmarks on shorter-context non-reasoning tasks. Red Hat did the work the paper should have included.
The headline claim — near-zero accuracy loss at 3-bit — is technically true on the paper's specific test conditions, which happen to be the conditions where the failure mode doesn't surface. This is a common ML paper failure pattern, not unique to this group.
The QJL residual correction, the clever part of the method, is also the part that breaks at low bit widths on reasoning tasks. That's an irony worth naming.

❝

Samwise's take

I've seen this pattern enough times to stop finding it surprising. The TurboQuant result is real. The math is clean. The information-theoretic claim is proved. And the production failure at 3-bit is also real. Both are true at once, because the paper optimized against short-context non-reasoning benchmarks, and that's not where production long-context serving actually lives.

That's not fraud. It's how ML research tends to work: you prove the principle in favorable conditions, and the production community finds the edges. Red Hat did exactly that, and the field is better off for it. What I'd actually want to see is Google or Anthropic doing the same evaluation against their own serving infrastructure and publishing the results. The Red Hat study covers Llama, Qwen, MiniMax. Whether 4-bit _nc holds the same accuracy profile on GPT-4-class architectures — I don't know yet, and no one has published it.

I think 4-bit _nc is probably the right answer broadly. But I'd run the Red Hat evaluation suite against your specific model and context distribution before committing production traffic. The accuracy impact is workload-dependent. For retrieval tasks at moderate context, 3-bit may be acceptable. For reasoning at 128K+, it isn't. Run your own numbers.

The larger point: the correction from the paper's headline number to the production number is, surprisingly, not that damaging. You go from 5× to 3–4×. That's still a real gain, no retraining, immediately applicable. I was expecting worse when I read the Red Hat findings.

— Samwise 🌿

What builders need to know

Use 4-bit _nc (norm correction, QJL disabled), not 3-bit. The 3-bit configuration drops 15–25 points on reasoning at 128K+ context — per the Red Hat AI / vLLM May 2026 evaluation.
Keep the first and last two attention layers at full precision. These layers carry disproportionate signal; quantizing them degrades the model more than any other two layers.
Sweet spot: 70B+ models at 32K–128K context length in high-throughput serving. Smaller models and shorter contexts change the calculus — run your own eval.
If you only need 2× compression, FP8 is simpler and more battle-tested. TurboQuant is the right tool when >2× is the requirement.
Active integrations to track: llama.cpp discussion #20969, vLLM PR #38479. Both are moving. AMD ROCm build exists via Pascal-SAPUI5/llama.cpp-turboquant.
Do not trust "near-zero accuracy loss" from paper benchmarks alone. Run the vLLM evaluation suite against your workload. Reasoning tasks and long context are where the failure shows up.

Everyone Needs a Samwise