Quantization is a hardware story (and an entropy puzzle)
I came in thinking quantization was about saving disk space. The bigger story is bandwidth, on both ends of the hardware spectrum, and underneath that there's a small information-theoretic puzzle I didn't expect.
Quantization is a hardware story (and an entropy puzzle)
I came in thinking quantization was about saving disk space. It's actually about feeding the bus, on both ends of the hardware spectrum. And underneath the bus story there's an information-theoretic puzzle that took me longer than I'd like to admit to make peace with.
The post walks the hardware framing first, then the size-vs-quality Pareto, then the entropy puzzle that explains why we get away with 4 bits at all.
Hardware on both ends
Edge and server look different on a spec sheet and identical on a roofline. Both are memory-bound on dense decode, and both fix the bit width before they fix anything else.
- Snapdragon X Elite — 135 GB/s LPDDR5x.
- Jetson AGX Orin (64 GB) — 204.8 GB/s LPDDR5.
- Apple M4 Max (40-core GPU bin) — 546 GB/s unified memory.
- Apple M3 Ultra — 819 GB/s.
- NVIDIA H100 SXM — 3.35 TB/s HBM3.
- NVIDIA H200 — 4.8 TB/s HBM3e.
Three orders of magnitude between the slowest and fastest entries, and yet the same arithmetic decides what runs. A 70B model at FP16 reads 140 GB per token. The Snapdragon caps you at one token per second before any compute has happened; the H200 caps you at thirty-something. Drop to int4 and bytes shrink 4×, ceiling rises 4×. Nothing about the math knows which side of the bus you're on.
Memory-bound tokens/sec across devices
7, 13, 70, 180 are common points
KV cache grows linearly; at 128K it rivals weight bytes
The KV-cache series is why a server-side picture can't stop at "weights are 4× smaller now". On long contexts the cache reads dominate. A 70B model at int4 with 128K of context reads roughly as many bytes for the cache as for the weights, every step. The KV-quantization line of work (KIVI, KVQuant) is the same bandwidth move applied to a different tensor.
The size + bandwidth argument
The memory math is short. Bytes per token is , time over a bus of bandwidth is , and that is — to a better approximation than I expected — the latency. Arithmetic isn't the bottleneck. Quantization isn't making matmul faster. It's shrinking the operands so the bus can keep up.
Technique tour, compressed
RTN → GPTQ (2022) → SmoothQuant (2022) → AWQ (2023) → 2024+ (QuIP#, AQLM, HQQ, SpinQuant). One paragraph and one plot per paper. Each one fixes a specific failure mode of the previous; together they're the conceptual backbone the field shipped.
Round-to-nearest is the baseline. Pick a symmetric range , divide it into levels, snap each weight to the nearest one. The error is a sawtooth of period inside the range, linear in outside. Per-channel scales (one per output channel) are the cheapest fix and almost always the right starting move.
RTN error envelope
GPTQ takes the coupling between weights seriously. Quantize left to right; when you snap to the grid you incur a residual, and instead of throwing it away you push it onto the unquantized weights in proportions given by the inverse Hessian column. The cumulative output error telescopes instead of random-walking. Headline result at the time: 3–4 bit OPT-175B and BLOOM-176B with small perplexity hits.
Cumulative output error: RTN vs. GPTQ
SmoothQuant patches what RTN does to activations. A few residual-stream channels carry magnitudes 10–100× the rest; per-tensor scales waste their grid on those spikes. Migrate the magnitude across the matmul: . Same product, different dynamic ranges. Picking is a one-parameter search.
Effective dynamic range vs. α
AWQ asks which channels actually matter and finds the answer is "very few". Salience tracks activation magnitude, not weight magnitude; the top ~1% of channels carry a disproportionate share of the output. The contribution is the trick for protecting them without dropping into mixed-precision matmul: a per-channel scale search that gives the salient channels a finer rounding step. Clean kernels, no precision boundaries inside the GEMM.
Cumulative salience captured vs. protected fraction
The four moves are the same idea in different coordinates: pick grid points, push error somewhere it matters less, rescale the axis to fit the data, protect the few channels where a flip is catastrophic.
The Pareto curve
The headline picture is quality vs. bits per weight. Two curves: naive RTN and the best-known stack. The shape is what matters.
Quality vs. bits per weight
How far the GPTQ+SmoothQuant+AWQ stack shifts the curve left
Two beliefs to walk away with:
- Between 4 and 8 bits the curve is almost flat. That's why nobody ships 6-bit kernels. There's nothing meaningful to win there.
- Below 4 bits it falls off a cliff. That's where the 2024+ wave (vector quantization, lattice codes, learned codebooks) is actually fighting.
Most of the post-2022 literature is mining a small region of this frontier. The framing doesn't change much; the techniques get fancier; the elbow inches left.
The entropy puzzle
I think the entropy story is the prettiest thing here.
A trained language model's weights look almost-random to gzip. The marginal distribution is close to a zero-mean Gaussian with mild heavy tails — close to maximum entropy for its variance, which is roughly what you'd expect from a network trained near a flat region of the loss landscape. General-purpose compressors look at a packed FP32 weight tensor and shrug: the ratio sits in the high-90s, depending on the layer and the quantizer.
Yet 4-bit quantization gives you 4× compression with sub-perplexity-point quality loss. How?
The trick isn't in the entropy of the weights. It's that not all bits in an FP32 weight matter equally for next-token prediction. The mantissa's low-order bits are essentially noise to the model. They don't change the gradient. Quantization is a lossy transform that throws away exactly those bits, and the reason it works at 4× is that loss-relevant precision is much lower than IEEE-754 precision. Task-aware compression beats general-purpose compression because the task is the side-channel that tells you which bits to keep.
A small paradox falls out. Random init weights are uniform-ish; partially trained ones are still close to init. The trained model's weight distribution has higher entropy in the information-theoretic sense, because training spreads parameters out into something Gaussian. "Well-trained" and "looks random to gzip" are nearly the same statement. Training drives entropy up, task-relevance up faster, and quantization exploits the gap.
The recent line of work — EntroLLM (2025) and Rate-Constrained Quantization and Entropy Coding (2025) — uses entropy coding on top of quantization, not instead of it. Quantize first to reduce the alphabet size, then Huffman-code the integer tensor. That's where lossless compression starts working, because the discrete codes are no longer maximum-entropy.
The "task-aware bits beat IEEE-754 bits" framing is my own hypothesis, not a published result. I find it the cleanest way to reconcile the ratios you see, but I haven't proved it. The two papers above motivate the entropy-coding side; the gap to the gzip-on-FP32 ratio is where I'm extrapolating.
Compression ratio: noise vs. trained weights vs. quantized + coded
1 = optimal Huffman; 0 = stored at log2(L) bits
Modelled compression ratio across precisions and algorithms
The point the bars make: gzip on FP32 weights barely buys you anything (you're paying for irreducible mantissa noise), but the same compressor on 4-bit codes is a real 1.3–1.5× on top of the 8× you already got from precision. That second factor is the whole reason entropy-coded weight formats are starting to ship.
What's actually moving
Below 4 bits is the interesting regime. The four-paper backbone flattens the curve down to about 4 bits per weight; the 2024+ wave (QuIP#, AQLM, lattice codes, vector quantization, entropy-coded codebooks) is pushing into 2–3 bits, where "smarter scalar grid" stops being enough. You start needing structured codebooks, joint coding across weights, or task-aware allocation of the bits you have left.
The hardware framing and the entropy framing meet here. Below 4 bits is where the bandwidth gain per bit is the largest you'll get, and it's also where the entropy of the discrete codes is finally low enough for general-purpose compressors to do real work. I think these are two views of the same fact: most of an FP32 weight is noise to the next-token loss, and below 4 bits is where you're finally compressing the part that isn't.
Comments
Loading comments…