Quantization is a hardware story (and an entropy puzzle)

For a long time I filed quantization under "saves disk space". Then I did the memory math and the disk turned out to be beside the point. Quantization is about feeding the bus, on both ends of the hardware spectrum. And underneath the bus story there's an information-theoretic puzzle that took me longer than I'd like to admit to make peace with.

Hardware on both ends

Edge and server look different on a spec sheet and identical on a roofline. Both are memory-bound on dense decode, and both fix the bit width before they fix anything else.

Snapdragon X Elite: 135 GB/s LPDDR5x.
Jetson AGX Orin (64 GB): 204.8 GB/s LPDDR5.
Apple M4 Max (40-core GPU bin): 546 GB/s unified memory.
Apple M3 Ultra: 819 GB/s.
NVIDIA H100 SXM: 3.35 TB/s HBM3.
NVIDIA H200: 4.8 TB/s HBM3e.

The spread between the slowest and fastest entries is enormous, and yet the same arithmetic decides what runs. A 70B model at FP16 reads 140 GB per token. The Snapdragon caps you at one token per second before any compute has happened; the H200 caps you at thirty-something. Drop to int4 and bytes shrink 4×, ceiling rises 4×.

Memory-bound tokens/sec across devices

FP16 ceilingINT4 ceiling4-bit (you)

FP16 ceiling, INT4 ceiling, and your chosen bit width, per device. Slide the context length up and watch the KV cache close the gap that quantization just opened.

Model size (B params)70

7, 13, 70, 180 are common points

Bits per weight4

Context length (K tokens)8

KV cache grows linearly; at 128K it rivals weight bytes

Model70 B params @ 4-bit

Context8 K tokens

Bytes / token37.8 GB

KV share7%

Snapdragon X Elite3.6 tok/s

H200127 tok/s

The KV-cache series is why a server-side picture can't stop at "weights are 4× smaller now". On long contexts the cache reads dominate. A 70B model at int4 with 128K of context reads roughly as many bytes for the cache as for the weights, every step. The KV-quantization line of work (KIVI, KVQuant) is the same bandwidth move applied to a different tensor.

The size + bandwidth argument

The memory math is short. Bytes per token is $P \cdot b / 8$ , time over a bus of bandwidth $B$ is $P b / (8 B)$ , and that (to a better approximation than I expected) is the latency. Arithmetic isn't the bottleneck. Quantization isn't making matmul faster. It's shrinking the operands so the bus can keep up.

Technique tour, compressed

RTN → GPTQ (2022) → SmoothQuant (2022) → AWQ (2023) → 2024+ (QuIP#, AQLM, HQQ, SpinQuant). One paragraph and one plot per paper. Each fixes a specific failure of the one before.

Round-to-nearest is the baseline. Pick a symmetric range $[-R, R]$ , divide it into $2^b$ levels, snap each weight to the nearest one. The error is a sawtooth of period $s = R/(2^{b-1}-1)$ inside the range, linear in $|w|$ outside. Per-channel scales (one $R$ per output channel) are the cheapest fix and almost always the right starting move.

RTN error envelope

Sawtooth bounded by ±s/2 inside [-R, R]; outside the range, error grows linearly and doesn't stop. The whole RTN tradeoff lives at the boundary between the two regimes.

Bits4

Weight magnitude1.00

Clip R / wMax100%

Shrink it and watch the tails take over

Levels (2^bits)16

Step size s0.1429

Max |error|0.0714

RMSE0.0412

GPTQ takes the coupling between weights seriously. Quantize left to right; when you snap $w_i$ to the grid you incur a residual, and instead of throwing it away you push it onto the unquantized weights in proportions given by the inverse Hessian column. The cumulative output error telescopes instead of random-walking. Headline result at the time: 3–4 bit OPT-175B and BLOOM-176B with small perplexity hits.

Cumulative output error: RTN vs. GPTQ

RTNGPTQ

RTN's residuals random-walk upward (√n growth); GPTQ's line flattens near one grid step and stays there. The flat line is the telescoping.

Bits3

Feedback ρ0.70

Drag to 0 and GPTQ becomes RTN

Weights swept120

Feedback ρ0.70

RTN RMS |Σv|0.9992

GPTQ RMS |Σv|0.3268 (−67.3%)

SmoothQuant patches what RTN does to activations. A few residual-stream channels carry magnitudes 10–100× the rest; per-tensor scales waste their grid on those spikes. Migrate the magnitude across the matmul: $X W = (X / \mathrm{diag}(m)) (\mathrm{diag}(m) W)$ . Same product, different dynamic ranges. Picking $m$ is a one-parameter search.

Effective dynamic range vs. α

activationsweights

The activation range falls fast as α grows; the weight range rises gently to meet it. The crossover is the zone where neither side wastes grid levels on the other's outliers.

α0.50

Paper default is α = 0.5

Outlier × median30

Outlier spike ×30

Spiked channels3

Act. range @α6.3

Weight range @α6.3

AWQ asks which channels actually matter and finds the answer is "very few". Salience tracks activation magnitude, not weight magnitude; the top ~1% of channels carry a disproportionate share of the output. The contribution is the trick for protecting them without dropping into mixed-precision matmul: a per-channel scale search that gives the salient channels a finer rounding step. Clean kernels, no precision boundaries inside the GEMM.

Cumulative salience captured vs. protected fraction

Pareto-ish CDF. Protect the top 1% of channels and you've already caught ~40% of total salience. The head is where the mass lives; the slider mostly confirms it.

Protect top α%1.00

Pareto β1.10

Pareto exponent β1.10

Protected channels1.0%

Captured salience38.9%

Channels for 90%41.3%

The four moves are the same idea in different coordinates: pick grid points, push error somewhere it matters less, rescale the axis to fit the data, protect the few channels where a flip is catastrophic.

The Pareto curve

The headline picture is quality vs. bits per weight. Two curves: naive RTN and the best-known stack. The shape is what matters.

Quality vs. bits per weight

naive RTNbest-known stack

Toy ppl(b) = ppl_fp16 · (1 + α · 2^(-βb)). Naive RTN elbows around 6 bits; the best-known stack pushes the elbow to ~4. The shaded region below the stack elbow is where the 2024+ techniques are aimed.

α (difficulty)6.00

β (bit elasticity)1.10

Technique-stack offset (bits)1.60

How far the GPTQ+SmoothQuant+AWQ stack shifts the curve left

fp16 ppl5.60

RTN elbow1.6 bits

Stack elbow1.6 bits

Stack offset1.6 bits

Two beliefs to walk away with:

Between 4 and 8 bits the curve is almost flat. That's why nobody ships 6-bit kernels. There's nothing meaningful to win there.
Below 4 bits it falls off a cliff. That's where the 2024+ wave (vector quantization, lattice codes, learned codebooks) is actually aimed: sub-4-bit is where those techniques are competing.

Most of the post-2022 literature is mining a small region of this frontier. The framing doesn't change much; the techniques get fancier; the elbow inches left.

The entropy puzzle

I think the entropy story is the prettiest thing here.

A trained language model's weights look almost-random to gzip. The marginal distribution is close to a zero-mean Gaussian with mild heavy tails (close to maximum entropy for its variance, which is roughly what you'd expect from a network trained near a flat region of the loss landscape). General-purpose compressors barely dent a packed FP32 weight tensor: the ratio sits in the high-90s, depending on the layer and the quantizer.

Yet 4-bit quantization gives you 4× compression with sub-perplexity-point quality loss. How?

The trick isn't in the entropy of the weights. It's that not all bits in an FP32 weight matter equally for next-token prediction. The mantissa's low-order bits are essentially noise to the model. They don't change the gradient. Quantization is a lossy transform that throws away exactly those bits, and the reason it works at 4× is that loss-relevant precision is much lower than IEEE-754 precision. Task-aware compression beats general-purpose compression because the task is the side-channel that tells you which bits to keep.

A small paradox falls out. Random init weights are uniform-ish; partially trained ones are still close to init. The trained model's weight distribution has higher entropy in the information-theoretic sense, because training spreads parameters out into something Gaussian. "Well-trained" and "looks random to gzip" are nearly the same statement. Training drives entropy up, task-relevance up faster, and quantization exploits the gap.

Recent work (EntroLLM (2025), Rate-Constrained Quantization and Entropy Coding (2025)) puts entropy coding on top of quantization, not instead of it. Quantize first to reduce the alphabet size, then Huffman-code the integer tensor. That's where lossless compression starts working, because the discrete codes are no longer maximum-entropy.

The "task-aware bits beat IEEE-754 bits" framing is my own hypothesis, not a published result. I find it the cleanest way to reconcile the ratios you see, but I haven't proved it. The two papers above motivate the entropy-coding side; the gap to the gzip-on-FP32 ratio is where I'm extrapolating.

Compression ratio: noise vs. trained weights vs. quantized + coded

IID Gaussian noisetrained FP32 weightsquantized + coded

Closed-form H(X)/H_max(X) proxy for what a general-purpose compressor would do on iid-ish data. IID Gaussian noise sits pinned at 1.0 (nothing to compress); trained FP32 weights sit just under it; quantized-then-entropy-coded drops to where 4-bit lives. The gap between the first two and the third is the whole puzzle.

Quantizer bits4

Distribution (gauss → tail → bimodal)0.40

Entropy-coding efficiency0.85

1 = optimal Huffman; 0 = stored at log2(L) bits

Distribution shapeGaussian → tail

Noise ratio1.00

FP32 weights ratio0.94

4b + coded ratio0.11

Effective compression8.9×

Modelled compression ratio across precisions and algorithms

FP32 · gzipFP32 · zstdFP32 · brotliFP16 · gzipFP16 · zstdFP16 · brotliINT8 · gzipINT8 · zstdINT8 · brotliINT4 · gzipINT4 · zstdINT4 · brotli

One bar per (precision, algorithm) pair; closed-form entropy-bound proxies, not measurements. What to see: the compressors barely dent FP32/FP16 (mantissa low-bits look like noise), then suddenly find structure at INT8/INT4.

Distribution (gauss → tail → bimodal)0.40

Parameters (millions)7000

Params7000 M

FP32 raw28000 MB

INT4 + brotli3206 MB

INT4 / FP320.11×

FP32 + zstd104% of raw

The point the bars make: gzip on FP32 weights barely buys you anything (you're paying for irreducible mantissa noise), but the same compressor on 4-bit codes is a real 1.3–1.5× on top of the 8× you already got from precision. That second factor is the whole reason entropy-coded weight formats are starting to ship.

What's actually moving

Below 4 bits is the interesting regime. The four-paper backbone flattens the curve down to about 4 bits per weight; the 2024+ wave (QuIP#, AQLM, lattice codes, vector quantization, entropy-coded codebooks) is pushing into 2–3 bits, where "smarter scalar grid" stops being enough. You start needing structured codebooks, joint coding across weights, or task-aware allocation of the bits you have left.

The hardware framing and the entropy framing meet here. Below 4 bits is where the bandwidth gain per bit is the largest you'll get, and it's also where the entropy of the discrete codes is finally low enough for general-purpose compressors to do real work. I think these are two views of the same fact: most of an FP32 weight is noise to the next-token loss, and below 4 bits is where you're finally compressing the part that isn't.

The experiment I owe this post is the real one: gzip over an actual checkpoint at each precision, measured instead of modelled. If you've run it, or you can poke a hole in the task-aware-bits framing, the comments are exactly what I want them for.

Comments

Loading comments…