TurboQuant (OSS)

Drop-in KV cache compression: 4–7x memory savings, zero accuracy loss

TurboQuant is the first open-source implementation of Google's ICLR 2026 KV cache compression algorithm, released as a pip-installable library (pip install turbokv). It compresses the key-value cache during LLM inference using three techniques: random rotation via QR decomposition to normalize data distribution, optimal scalar quantization using Lloyd-Max algorithms for 4-bit encoding, and bit packing to store vectors in 66 bytes instead of 256. The result is 4–7x memory savings at inference time with bit-identical prefill logits across tested models (7B–70B parameters). What makes it deployable rather than just a research reproduction is that it works as a drop-in replacement for HuggingFace Transformers cache — no retraining, no calibration data required, and it generalizes across Llama, Qwen, Gemma, and Phi architectures. At 32K context sequences, the library recovers up to 5.7GB of VRAM. Needle-in-haystack tests show 100% recall, suggesting the quantization isn't eroding long-context retrieval quality. The original Google paper is implemented at multiple repos, but this one stands out for the engineering rigor: the author discovered that smaller Qwen models have outlier attention layers requiring special handling, a refinement not in the original paper. For practitioners running large models on constrained hardware or serving long-context workloads, this is immediately actionable. The key caveat: it's a one-person repo with 1 star, so production adoption requires validation.

Panel Reviews

Ship

“Drop-in HuggingFace cache replacement with no retraining and verified zero accuracy loss on multiple architectures is exactly what inference optimization should look like. The pip install story makes it trivially testable.”

Skip

“1 star, one contributor, zero forks — it's too early to trust this in production. The "bit-identical logits" claim needs independent replication. Valuable reference implementation, not yet a production library.”

Skip

“KV cache compression is one of the most impactful levers for long-context inference at scale. Getting 4–7x memory reduction without accuracy loss as a no-install-required library changes the economics of running large models locally.”

Skip

“This is purely for engineers running their own inference. Not relevant to my workflow, but the VRAM savings will eventually mean better local model availability for tools I do use.”

Community Sentiment

OverallNaN mentions

NaN% positiveNaN% neutralNaN% negative

HackerNews mentions

“”

Reddit mentions

“”

Twitter/X mentions

“”