Selective KV Quantization | Anushri Suresh

Don’t Drop It, Compress It: Selective KV Quantization

Modern large language models (LLMs) rely on a Key–Value (KV) cache to store intermediate representations for long‐context generation, but this cache can consume 90%+ of GPU memory as sequence length grows. Existing solutions either drop old tokens (sliding window, attention sinks) or uniformly compress all tokens, often trading off accuracy for memory.

Selective KV Quantization bridges this gap by:

Preserving “sink” (initial prompt) and a sliding “window” of recent tokens in full precision (FP16)
Aggressively quantizing older tokens to int8
Dequantizing on‐the‐fly only when needed for attention

This hybrid strategy delivers up to 2× memory savings with no increase in perplexity (5.56) and only a minor drop in ROUGE-L (0.2073 → 0.1709).

Key Results

Caching Strategy	Perplexity	ROUGE-L	Memory Reduction
Full Cache	5.56	0.2073	Baseline
Compressed Cache	132.08	0.1030	~4×
Quantized Cache	5.56	0.1709	~2×

Technical Highlights

Memory Efficiency: KV cache accounts for over 90% of total GPU memory usage
Accuracy Preservation: Maintains full-cache perplexity while achieving significant memory savings
Selective Strategy: Smart preservation of important tokens (sink + sliding window) in full precision
On-the-fly Processing: Efficient dequantization only when needed for attention computation

Implementation

The project includes comprehensive evaluation scripts, PyTorch profiler integration, and support for various compression strategies. The codebase is built on top of PyTorch and optimized for CUDA-enabled GPUs.

GitHub Repository: anushrisuresh/kv-quantization

Technologies: PyTorch, CUDA, Python, Quantization, Large Language Models