Sparse embeddings: a six-week negative result

I kept thinking the embedding matrix was the easy part of parameter-golf. It's just a lookup table. You quantise it, you compress it, you ship. Then I spent six weeks on it and shipped nothing.

This is the post about that. The headline, honestly stated: doubling the vocab buys about 0.01 BPB on validation, and every way I found to compress the doubled embedding back under the 16 MB artifact cap cost more than 0.01 BPB. The locked variant is what it was before any of this work (SP8192 + CaseOps + INT7, 1.06755 BPB at 15.91 MB, 90 KB under the cap). Everything I tried at SP12k, SP16k, SP24k, SP32k landed somewhere between +0.005 and +0.045 BPB worse once compression was honest.

The naive argument is right

Bigger vocab is real. Same model, same recipe, same training budget, just more rows in the embed and a bigger LM head:

vocab	best INT-bits	val BPB	brotli MB
SP8k	INT7	1.06755	15.91
SP12k	INT7	1.06024	18.10
SP16k	INT7	1.06024	18.10
SP24k	INT7	1.05338	20.36
SP32k	INT7	1.05076	22.52

Doubling vocab from 8k to 16k buys 0.0073 BPB. Doubling again to 32k buys another 0.0095. The curve hadn't saturated by SP32k. If I had a free 6 MB lying around, I'd just take the bigger vocab and stop writing.

Validation BPB vs compressed embed bytes

dense FP16dense INT8dense INT7dense INT5dense INT4sparse FP16 (top-K)freq-weighted bits

Each series sweeps vocab 8k→32k. The vertical line is the 16 MB brotli cap; the horizontal line is the locked SP8k INT7 baseline. Pick a vocab with the slider and read the dense INT7 row underneath: it tells you which side of the cap that vocab lands on.

Vocab size8192

Only SP8k stays under the cap at INT7; every bigger vocab blows it

Vocab8,192

Dense INT7 @ vocab1.9 MB / 1.0675 BPB

Sparse-FP16 @ vocab1.1 MB

Baseline (locked)1.9 MB / 1.0675 BPB

The plot is the whole argument in one frame. The dense INT7 dot sits left of the cap line at SP8k and right of it everywhere else.

The cost

At $d = 512$ and INT7, the embed bytes are $V \cdot d \cdot 7 / 8$ , which brotli shrinks by about 0.55. The LM head is tied so it's free, but the rest of the model still has to fit inside the same 16 MB cap. Every doubling of the vocab adds about 6 MB compressed at INT7. The cap is 16. The math doesn't close.

Everything that follows is some way of trying to fit the bigger vocab in fewer bytes. None of them worked.

Approach 1: per-token bit allocation

The first instinct is right. Token frequencies are Zipf-like. The top thousand tokens carry maybe half the probability mass at our scale. Their rows are hit more often, their gradients are healthier, and they probably tolerate higher precision than the long tail. So: top-K rows at INT8, the rest at INT4 or INT5.

Per-row bit allocation under Zipf

allocated bitslog frequency

Bars: bits assigned per row, head vs tail. The overlaid line is log token frequency. Drag the cutoff down and the tail bits with it, and watch the compressed-bytes readout fall. Then read the numbers below for what that saving costs in BPB.

Vocab size12288

Head cutoff (%)8.00

Top-K% rows protected at headBits

Head bits8

Tail bits5

Below 5 is the cliff

Zipf s1.05

Natural text is roughly s ≈ 1.0–1.1

Rows above cutoff983 / 12,288

Compressed embed2.16 MB

vs dense INT7-0.73 MB

Weighted err / uniform head14.09

I built a damage atlas (per-row gradient norm × frequency) and ran the staged sweeps. Top-1024 INT8 + rest INT5. Top-2048 INT8 + rest INT4. Three-tier (INT8 / INT6 / INT4). The shape is right. The magnitude is small. The best frequency-weighted config beat uniform INT5 by about 0.005 BPB. Uniform INT5 was already +0.012 over INT7. Net cost over the locked INT7 baseline: about +0.007 BPB. Not enough.

The Pareto floor for embed-bit allocation at this scale is INT7. Below INT5 the wheels come off: INT4 cost +0.120 BPB at SP32k, INT3 cost +0.724. INT ≤ 5 on the embed isn't a tradeoff, it's a cliff.

Approach 2: sparse-trained embedding

This was the path I had hopes for. Most rows in the embed are sparse-friendly in principle: only a handful of dimensions per token actually fire under any given context. So train under a per-row top-K mask, store only the kept values in FP16 plus a bitmap, brotli the whole thing.

The sparsity needed to fit the cap rises with vocab. Phase A measured it directly:

vocab	sparsity needed	brotli MB
SP10k	85%	1.46
SP12k	90%	1.14
SP16k	92%	1.28
SP24k	95%	1.34

This part worked. The bitmap-FP16-brotli format is genuinely tight. The serializer lives in train_gpt.py and it's about 30 lines.

The training is where it fell apart. The mask has to be re-applied after the EMA update, otherwise the EMA shadow leaks gradient back into the masked positions and the effective sparsity at save time is 60% even though the live weights are 90%. Re-applying the mask post-EMA was the bug that ate two days. I want those two days back.

Once the mask was honest, Phase C: SP12k+CO topk-0.10, four seeds (42, 43, 44, 1337), full training. Sidecar TTT BPB landed at 1.0959. Best 1.0931, mean ~1.097. The range across seeds was 0.0078: no lottery, no rescue from a lucky init. Decompose it and the story is plain. The bigger vocab buys −0.012, sparse-FP16 costs +0.045, net +0.033 over the SP8k INT7 baseline.

The mechanism, I think, is simple enough. Masked weights are zero. The model loses representation capacity for the tail tokens faster than the wider vocab amortises. You aren't getting the bigger vocab; you're getting a smaller effective vocab with a more expensive read path.

When it really fell over

I tried stacking sparse-FP16 with the full hparam stack from PR #1855 (BETA2=0.99, MLP_CLIP=11.5, EMBED_CLIP=14.0, WARMDOWN_FRAC=0.85, TTT_LORA_RANK=80, the works). Default-recipe sparse-FP16 was around 1.10 BPB. The stacked variant plateaued at 1.32092. The model just sat there at 1.32 ☹️.

I never isolated which interaction broke it. My best guess is TTT_LORA_RANK × sparse-embed gradient dynamics, or WARMDOWN_FRAC × sparse-EMA, but I didn't run the orthogonal sweep. Recipe stacks don't compose for free, especially when one of the recipe pieces is changing the geometry of a single tensor.

What's still open

The conceptually right answer is a straight-through-estimator-trained mixed-bit embed: forward in INTk, backward in FP16, learn the per-row bit allocation. I implemented it. It never converged under Muon; the gradient mismatch from the STE forward seems to interact badly with Muon's spectral update, and the embed loss curve went flat early and stayed there. I think this is the right thing to try next, but I don't think a single sentence of "the answer is STE" without working code counts as solved.

A factorised embed (low-rank residual on top of a quantised dense base) didn't help either. Possible bug: the residual was receiving gradients through the masked positions, which should have been zeroed. I shelved it before isolating.

The plot I keep coming back to is the vocab sweep. The gain is real, the curve hasn't saturated, and the question is whether the next compression idea costs less than the bigger vocab buys.

I don't know yet.

Comments

Loading comments…