WRITING

008LATEST

Teaching a language model to think in Finnish

Your AI answers you in Finnish, but it thinks in English. I tried to change that, and along the way it turned out you cannot even reward a model into thinking in a language it never tries. Here is the diagnosis, the fix, and the honest failures.

contains:interactive plotquiz· 18 min· en / fi

2026

007Can a language model remember you without rereading you?

May 1, 2026

Every request arrives knowing nothing about who's asking, so we stuff facts into the prompt and reread them forever. I spent months of my mech-interp PhD trying to precompute a per-user memory instead: two clean failures, one bank architecture that works, and a layer-injection picture I'd defend hardest.

contains:interactive plot· 10 min

May 1, 2026

006Sparse embeddings: a six-week negative result

April 24, 2026

I thought the embedding matrix was the easy part of parameter-golf: quantise it, compress it, ship it. Six weeks later I'd tried per-token bits, sparse training, and STE, and every one of them lost to plain dense INT7. Doubling the vocab buys ~0.01 BPB; buying the bytes back costs more.

contains:interactive plot· 5 min

April 24, 2026

005Quantization is a hardware story (and an entropy puzzle)

April 17, 2026

I came in thinking quantization was about saving disk space. The bigger story is bandwidth, on both ends of the hardware spectrum, and underneath that there's a small information-theoretic puzzle I didn't expect.

contains:interactive plot· 7 min

April 17, 2026

004Why do I keep reaching for state-space models on the edge?

April 10, 2026

Every time I push a model down to a phone, the KV cache is what kills me first. SSMs are the cleanest answer I've found, and around mid-2025 SSM hybrids started showing up in frontier models.

contains:interactive plot· 11 min

April 10, 2026

003Attention variants beyond softmax: a 2026 map

April 3, 2026

Open models quietly stopped putting softmax attention on every layer, and I couldn't find a single chart showing what replaced it. So I drew the map myself: MLA, linear, sparse, and the hybrid stacks three labs landed on independently, all on the same axes.

contains:interactive plotquiz· 7 min

April 3, 2026

002Why does every lab ship the same transformer block?

March 27, 2026

Open the code for Llama, Qwen, DeepSeek, or Gemma and you find nearly the same forward pass. This is my catalogue of everything that changed outside attention since the 2017 paper (norms, gates, tokenizer, embeddings, position) and why each swap stuck.

contains:interactive plotquiz· 11 min

March 27, 2026

001Hi!

March 20, 2026

Who I am, why I'm writing here instead of only in papers, and why the plots on this site have sliders.

· 4 min

March 20, 2026