Serving has a memory problem. Stuffing facts into the prompt is the answer everyone uses because the alternatives didn't work. A field report on three things we tried in my mech-interp PhD work — two clean failures, one architecture that's getting traction, and the layer-injection finding that fell out of all of them.
WRITING
SearchDoubling the vocab buys ~0.01 BPB. Compressing the doubled embedding back under a 16 MB cap costs more. A tour of every embedding-side trick I tried in parameter-golf and why each one lost to a plain dense INT7 baseline.
I came in thinking quantization was about saving disk space. The bigger story is bandwidth, on both ends of the hardware spectrum, and underneath that there's a small information-theoretic puzzle I didn't expect.
Every time I push a model down to a phone, the KV cache is what kills me first. SSMs are the cleanest answer I've found, and around 2025 the field stopped pretending the SSM-vs-attention split was a competition.
A short tour of what open models actually ship in place of full softmax attention in 2025-2026. MLA, linear (lightning), sparse (NSA / DSA), and the hybrid stacks three labs landed on independently. One cache plot, one throughput plot, one Pareto sketch, and the table that names names.
Open the code for Llama, Qwen, DeepSeek, Gemma, and you'll find nearly identical design decisions. What has changed since the 2017 paper, in normalisation, gating, tokenizer, embeddings, position, and why each swap stuck.
A short note on who's writing this blog and why.