Sparse embeddings: a six-week negative result
Doubling the vocab buys ~0.01 BPB. Compressing the doubled embedding back under a 16 MB cap costs more. A tour of every embedding-side trick I tried in parameter-golf and why each one lost to a plain dense INT7 baseline.
Sparse embeddings: a six-week negative result
I kept thinking the embedding matrix was the easy part of parameter-golf. It's just a lookup table. You quantise it, you compress it, you ship. Then I spent six weeks on it and shipped nothing.
This is the post about that. The headline, honestly stated: doubling the vocab buys about 0.01 BPB on validation, and every way I found to compress the doubled embedding back under the 16 MB artifact cap cost more than 0.01 BPB. The locked variant is what it was before any of this work — SP8192 + CaseOps + INT7, 1.06755 BPB at 15.91 MB, 90 KB under the cap. Everything I tried at SP12k, SP16k, SP24k, SP32k landed somewhere between +0.005 and +0.045 BPB worse once compression was honest.
The naive argument is right
Bigger vocab is real. Same model, same recipe, same training budget, just more rows in the embed and a bigger LM head:
| vocab | best INT-bits | val BPB | brotli MB |
|---|---|---|---|
| SP8k | INT7 | 1.06755 | 15.91 |
| SP12k | INT7 | 1.06024 | 18.10 |
| SP16k | INT7 | 1.06024 | 18.10 |
| SP24k | INT7 | 1.05338 | 20.36 |
| SP32k | INT7 | 1.05076 | 22.52 |
Doubling vocab from 8k to 16k buys 0.0073 BPB. Doubling again to 32k buys another 0.0095. The curve hadn't saturated by SP32k. If I had a free 6 MB lying around, I'd just take the bigger vocab and stop writing.
Validation BPB vs compressed embed bytes
SP8k locked baseline; SP12k–SP32k all blow the cap at INT7
The plot is the whole argument in one frame. Pick a vocab. Read off the dense INT7 dot. It's left of the cap line at SP8k and right of the cap line everywhere else.
The cost
At and INT7, the embed bytes are , which brotli shrinks by about 0.55. The LM head is tied so it's free, but the rest of the model still has to fit inside the same 16 MB cap. Every doubling of the vocab adds about 6 MB compressed at INT7. The cap is 16. The math doesn't close.
Everything that follows is some way of saying "what if we put the bigger vocab on a tighter diet?" None of them worked.
Approach 1: per-token bit allocation
The first instinct is right. Token frequencies are Zipf-like. The top thousand tokens carry maybe half the probability mass at our scale. Their rows are hit more often, their gradients are healthier, and they probably tolerate higher precision than the long tail. So: top-K rows at INT8, the rest at INT4 or INT5.
Per-row bit allocation under Zipf
Top-K% rows protected at headBits
Natural text is roughly s ≈ 1.0–1.1
I built a damage atlas (per-row gradient norm × frequency) and ran the staged sweeps. Top-1024 INT8 + rest INT5. Top-2048 INT8 + rest INT4. Three-tier (INT8 / INT6 / INT4). The shape is right. The magnitude is small. The best frequency-weighted config beat uniform INT5 by about 0.005 BPB. Uniform INT5 was already +0.012 over INT7. Net cost over the locked INT7 baseline: about +0.007 BPB. Not enough.
The Pareto floor for embed-bit allocation at this scale is INT7. Below INT5 the wheels come off — INT4 cost +0.120 BPB at SP32k, INT3 cost +0.724. INT ≤ 5 on the embed isn't a tradeoff, it's a cliff.
Approach 2: sparse-trained embedding
This was the path I had hopes for. Most rows in the embed are sparse-friendly in principle: only a handful of dimensions per token actually fire under any given context. So train under a per-row top-K mask, store only the kept values in FP16 plus a bitmap, brotli the whole thing.
The sparsity needed to fit the cap rises with vocab. Phase A measured it directly:
| vocab | sparsity needed | brotli MB |
|---|---|---|
| SP10k | 85% | 1.46 |
| SP12k | 90% | 1.14 |
| SP16k | 92% | 1.28 |
| SP24k | 95% | 1.34 |
This part worked. The bitmap-FP16-brotli format is genuinely tight. The serializer lives in train_gpt.py and it's about 30 lines.
The training is where it fell apart. The mask has to be re-applied after the EMA update, otherwise the EMA shadow leaks gradient back into the masked positions and the effective sparsity at save time is 60% even though the live weights are 90%. Re-applying the mask post-EMA was the bug that ate two days. I want those two days back.
Once the mask was honest, Phase C: SP12k+CO topk-0.10, four seeds (42, 43, 44, 1337), full training. Sidecar TTT BPB landed at 1.0959. Best 1.0931, mean ~1.097. The range across seeds was 0.0078 — no lottery, no rescue from a lucky init. Decompose it: the bigger vocab buys −0.012, sparse-FP16 costs +0.045, net +0.033 over the SP8k INT7 baseline.
The mechanism, I think, is simple enough. Masked weights are zero. The model loses representation capacity for the tail tokens faster than the wider vocab amortises. You aren't getting the bigger vocab; you're getting a smaller effective vocab with a more expensive read path.
When it really fell over
I tried stacking sparse-FP16 with the full hparam stack from PR #1855 (BETA2=0.99, MLP_CLIP=11.5, EMBED_CLIP=14.0, WARMDOWN_FRAC=0.85, TTT_LORA_RANK=80, the works). Default-recipe sparse-FP16 was around 1.10 BPB. The stacked variant plateaued at 1.32092. The model just sat there at 1.32 ☹️.
I never isolated which interaction broke it. My best guess is TTT_LORA_RANK × sparse-embed gradient dynamics, or WARMDOWN_FRAC × sparse-EMA, but I didn't run the orthogonal sweep. Recipe stacks don't compose for free, especially when one of the recipe pieces is changing the geometry of a single tensor.
What's still open
The conceptually right answer is straight-through-estimator-trained mixed-bit embed: forward in INTk, backward in FP16, learn the per-row bit allocation. I implemented it. It never converged under Muon — the gradient mismatch from the STE forward seems to interact badly with Muon's spectral update, and the embed loss curve looked like a logarithmic shrug. I think this is the right thing to try next, but I don't think a single sentence of "the answer is STE" without working code counts as solved.
A factorised embed (low-rank residual on top of a quantised dense base) didn't help either. Possible bug: the residual was receiving gradients through the masked positions, which should have been zeroed. I shelved it before isolating.
The plot I keep redrawing is the vocab sweep. The gain is real, the curve hasn't saturated, and the question is just whether the next idea pays the storage tax.
I don't know yet.
Comments
Loading comments…