Why does every lab ship the same transformer block?

I keep opening modelling code from a different lab and finding the same block. Llama, Qwen, DeepSeek, Gemma, Phi. Different teams, different stacks, different post-training recipes, but the forward pass is mostly the same. Reading Attention is All You Need (2017) now, almost everything outside the attention call has been replaced.

That's the surprising bit. Fields with this much money in them usually diverge. This post is the catalogue of the non-attention swaps and why each one stuck. Attention and its recent developments (linear, sliding-window, MLA, GQA, MoE-routed) are also freaking many; they get a separate post. State space models are weirder still and get the post after that.

Decoder-only won

Encoder-decoder (T5 (2019), BART (2019)) and encoder-only (BERT (2018)) had the lead in 2018-2020. Then GPT-3 (2020) showed that next-token prediction at scale gives you everything the other two were doing (translation, classification, span infilling) plus open-ended generation. By the time Llama-1 shipped (2023), encoder-only had narrowed to retrieval and embeddings, and encoder-decoder lived mostly in T5 lineages.

Causal LM → seq2seq → MLM → causal LM (with prompts). The forward pass got simpler. Everything else in this post is what filled the space the simplification opened up.

Normalisation: drop the mean, swap the order, normalise the queries

LayerNorm subtracts the mean, divides by std, applies a learned affine. Roughly $6d$ FLOPs per token. RMSNorm (2019) noticed the mean-subtraction step doesn't change the loss curve. Strip it:

\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2 + \epsilon}} \odot g.

Two passes, $\approx 4d$ FLOPs. The argument: the next linear layer can re-centre its inputs anyway, so the mean term was always redundant. Quality holds in every replication I've seen. The FLOPs come back for free.

LayerNorm vs RMSNorm: per-call cost

LayerNormRMSNorm

Two straight lines and a gap that opens linearly with sequence length; you pay it twice per layer in pre-norm. Pull the kernel-launch slider to 0 and see how much of each call was overhead rather than math.

Hidden dim d4096

Max sequence length8192

Device TFLOP/s200

Kernel launch (µs)3.00

Hidden dim d4096

LayerNorm FLOPs/token24576

RMSNorm FLOPs/token16384

Savings at max seq8.4%

The other normalisation argument is about order. The original transformer puts the norm after each residual (post-norm: $x_{l+1} = \text{norm}(x_l + f(x_l))$ ). Xiong et al. (2020) showed that this contracts gradients geometrically with depth, which is why the original paper needed warmup tricks to train at all. Pre-norm ( $x_{l+1} = x_l + f(\text{norm}(x_l))$ ) keeps a clean residual path through depth and trains stably without warmup. Every modern decoder I'm aware of is pre-norm.

Gradient norm through depth: pre-norm vs post-norm

pre-normpost-norm

A toy linearisation, not a training curve: post-norm contracts the gradient by roughly (1+α²)^(-1/2) per block, so it falls off geometrically with depth while pre-norm's residual path stays open. Push α up and post-norm dies faster.

Depth L (layers)48

Llama-3 8B is 32; 70B is 80

Residual scale α1.00

Bigger α = stronger contraction in post-norm

Depth L48

Residual scale α1.00

Per-block contraction (post)0.707

Pre-norm ‖∇‖ at L0.837

Post-norm ‖∇‖ at L5.96e-8

The qualitative shape is what the literature shows: post-norm at $L = 48$ has already lost most of its gradient signal. Crank depth to 128 and the post-norm line touches zero.

The third normalisation move is recent. QK-Norm applies an extra RMSNorm to $Q$ and $K$ separately before the attention dot product. The story is empirical: at large scale, $Q \cdot K$ logits drift to magnitudes where softmax saturates, training stalls, and loss curves blow up unpredictably. Normalising each side first keeps the logit distribution well-conditioned. It started showing up in 2020 vision transformer papers, was rediscovered for LLM stability in 2024, and is now standard in Llama-3, Gemma 3, Qwen3, and most frontier open releases. It costs almost nothing and removes a class of training instabilities. I think that's a good trade. Stanford's CS336 lecture walks through QK-Norm at 1:10:25 if you want the video version.

SwiGLU and the GLU family

The FFN block in GPT-2 is two linears with a pointwise nonlinearity:

\text{FFN}(x) = W_{\text{out}} \, \text{GELU}(W_{\text{in}} x).

GLU Variants (2020) replaced the pointwise step with a gated unit:

\text{SwiGLU}(x) = W_{\text{out}} \, \bigl( \text{swish}(W_1 x) \odot (W_2 x) \bigr),

with $\text{swish}(z) = z \cdot \sigma(z)$ . The same paper introduced GeGLU (gelu-gated) and ReGLU (relu-gated). PaLM (2022) was the first big training run to ship SwiGLU, and Llama-1 brought it to the first widely used open model. To match parameter count with the non-gated baseline you shrink the inner width by $2/3$ , so the compute is roughly $1.5\times$ a plain FFN per token.

SwiGLU, GeGLU, ReGLU on a 1D slice

SwiGLUGeGLUReGLU

Three activations on one gate template. SwiGLU and GeGLU sit on top of each other (swish and gelu are near-twins); ReGLU has a sharper kink where the gate crosses zero. Sweep w₁ through zero to find it.

gate weight w₁1.20

value weight w₂0.80

x range4

Gate templateact(w₁·x) · (w₂·x)

SwiGLU activationswish (z·σ(z))

GeGLU activationgelu (tanh approx.)

ReGLU activationrelu

The real loss-per-parameter ranking from the paper has SwiGLU ≈ GeGLU > ReGLU, but the gap between any of these and a plain ReLU FFN is much larger than the gap between them. The gate carves shapes a pointwise nonlinearity can't. That's the actual win, and the activation choice on the gate branch is mostly a footnote.

Tokenizer: BPE → SentencePiece → byte-BPE, vocab climbing

The tokenizer story is shorter. BPE (2015) became the default early. SentencePiece (2018) added Unigram and language-agnostic byte handling. Modern OpenAI/Anthropic-lineage tokenizers (cl100k_base, o200k_base) and Llama-3's 128k tokenizer are byte-BPE variants tuned for code and multilingual coverage. Vocab grew from 32k (Llama-1, Llama-2) to 128k (Llama-3) to ~200k (GPT-4o).

The interesting open question isn't which tokenizer to pick. It's whether tokenizing should exist at all. Byte-level models (no tokenizer, learn directly on UTF-8 bytes) have looked promising on smaller scales but haven't displaced BPE at frontier scale. The bet is open: tokenizers buy you context-length efficiency (fewer tokens to represent the same text) at the cost of an extra preprocessing step that doesn't generalise across languages cleanly. Raw bytes feel architecturally cleaner. Whether they win depends on whether someone makes byte-level pretraining cost-competitive at frontier scale. I'm watching that one.

Bytes per token vs vocab size

bytes / tokenembedding share × 10

A heuristic (Zipf-distributed unigram, α≈1), not measured data. Bytes per token climbs like log V and flattens fast; the embedding-share line (×10 so it fits on the same axis) keeps climbing. Drag Zipf α down and watch compression saturate even earlier.

Vocab V (k tokens)128

32 = Llama-1/2; 128 = Llama-3; 200 = GPT-4o

Hidden dim d4096

Total params P (B)8.00

Zipf α1.00

Smaller α = heavier tail = compression saturates earlier

Vocab V128k tokens

Hidden dim d4096

Total params P8.0B

Zipf α1.00

Bytes / token (heuristic)4.16

Embedding-table share6.55%

The compression curve is sublinear in $V$ . Doubling the vocab from 32k to 64k buys a meaningful drop in bytes/token; doubling again to 128k buys less; the next doubling to 256k buys almost nothing. There's a knee, and frontier tokenizers sit near it. The embedding-share curve is the other half of the trade. We'll come back to that in a later blogpost once my experiment results land.

Embedding tying: why "in" and "out" used to be the same job

Press & Wolf (2017) noticed something cheap. The input embedding maps token id → vector. The output projection (sometimes called the unembedding or LM head) maps vector → distribution over token ids. Both are $V \times d$ matrices. Why not share weights? They tested it, with great results.

The trick survived for years. GPT-2, GPT-J, Llama-1, Llama-2, Gemma 1/2 small variants all tied. Then frontier scale broke the assumption. Llama-3 (2024) untied. DeepSeek-V3 untied. Qwen3's small models tie, large models untie. One reading: untying buys a real quality bump for the added parameter count, and past some model size that trade is worth it. The more interesting reading is that "in" and "out" aren't quite the same job once you have enough capacity to distinguish them. The input role wants similar tokens to land near each other in vector space; the output role wants the softmax to separate them.

Embedding parameters as a share of total model

tied (V·d)untied (2·V·d)

Tied (V·d) and untied (2·V·d) share of the total, swept over model size at V=128k, d=4096. At 1B the embedding table is a third of the model; at 70B it's a few percent. Slide P and watch the economic argument for tying evaporate.

Vocab V (k tokens)128

Hidden dim d4096

Min P (B)0.50

Max P (B)200

Vocab V128k

Hidden dim d4096

P range0.5B – 200B

Tied share at max P0.26%

Untied share at max P0.52%

At V=128k, d=4096, P=1B you're spending ~50% of the model on the embedding table if you untie. Tying takes that to 25%. At P=70B the untied share is ~1.5% and tying buys you 0.7%. Below ~7B I'd still tie; above 30B the savings stop mattering and the expressivity argument starts to.

Position: from a parameter table to a function

GPT-2 stored absolute positions as a learned $L \times d$ table. Two problems. The model couldn't generalise to longer contexts than it saw. And the input was absolute, so any "this token is two positions after that one" structure had to be rediscovered by attention from the difference of two arbitrary vectors.

RoFormer (2021) made position a function. Treat each pair of dimensions in $Q$ and $K$ as a 2D subspace, rotate by $m \cdot \theta_i$ at position $m$ , and the attention dot product depends only on the relative offset $p - q$ . The structural consequence (the one I want for this post) is that the positional table moves out of the parameter budget entirely. RoPE is parameter-free. Its base $b$ is a hyperparameter, and you can dial it after training to push context out without retraining.

GPT-2 → RoPE: position went from "another lookup table the optimiser has to fill" to "a deterministic geometric transform applied at every layer's $Q$ , $K$ ". The deep-dive (NTK-aware scaling, YaRN, NoPE, ALiBi) might get its own post if I find the time. For now the structural framing is the point: in the modern stack you're not training a position embedding at all.

A modern block in one list

Here's the list. Every modern open decoder layer looks roughly like this:

Residual + RMSNorm (pre-norm)
Attention (deferred to the next post; usually with QK-Norm and RoPE applied to $Q$ , $K$ )
Residual
Residual + RMSNorm
SwiGLU FFN (two parallel linears into the inner width, swish gate, output projection)
Residual

Outside the block: byte-BPE tokenizer at ~128k vocab, untied embeddings if the model is large, no positional table.

Modern block vs vanilla GPT-2 block: quality per FLOP

vanilla GPT-2 blockmodern block

A toy composition, not a measurement. The x-axis morphs a vanilla GPT-2 block into the full modern stack; no single lift moves the line much on its own. Zero out one lift at a time and see how small each individual win is; the separation at 1 is all of them stacked.

SwiGLU quality lift4%

RMSNorm quality lift0.5%

QK-Norm quality lift1.5%

advantage scale1.50

SwiGLU Δquality/ΔFLOPs+4.0% / +3.0%

RMSNorm Δquality/ΔFLOPs+0.5% / -1.0%

QK-Norm Δquality/ΔFLOPs+1.5% / +0.4%

Composite Q/F edge+5.2%

Recent launches as data points

Two years of open releases, grouped by which architectural choices they ship. The table is wider than a phone screen by design. Scroll horizontally if you're on mobile.

Model	Year	Norm	FFN	Tokenizer	Position	Embedding
GPT-2	2019	LayerNorm (post→pre)	GELU	50k BPE	learned absolute	tied
GPT-3	2020	LayerNorm (pre)	GELU	50k BPE	learned absolute	tied
PaLM	2022	RMSNorm	SwiGLU	256k SentencePiece	RoPE	untied
Llama-1	2023	RMSNorm	SwiGLU	32k SentencePiece	RoPE	tied
Llama-2	2023	RMSNorm	SwiGLU	32k SentencePiece	RoPE	tied
Mistral 7B	2023	RMSNorm	SwiGLU	32k SentencePiece	RoPE	tied
Llama-3	2024	RMSNorm	SwiGLU	128k byte-BPE	RoPE (high base)	untied
Phi-4	2024	RMSNorm	SwiGLU	tiktoken o200k	RoPE	untied
DeepSeek-V3	2024	RMSNorm + QK-Norm	SwiGLU (MoE)	128k byte-BPE	RoPE (decoupled)	untied
Gemma 3	2025	RMSNorm + QK-Norm	GeGLU	256k SentencePiece	RoPE	tied (small) / untied (large)
Llama-4	2025	RMSNorm + QK-Norm	SwiGLU (MoE)	128k byte-BPE	RoPE (high base)	untied
Qwen3	2025	RMSNorm + QK-Norm	SwiGLU	152k byte-BPE	RoPE	tied (≤4B) / untied (≥7B)

The non-attention columns line up fast. Norm: LayerNorm → RMSNorm → RMSNorm + QK-Norm. FFN: GELU → SwiGLU (with GeGLU as a near-equivalent variant). Tokenizer: 32k SentencePiece → 128–256k byte-BPE. Position: learned absolute → RoPE (with rising bases for long context). Embedding: tied → untied past ~7B.

The attention column is the one still in flux. Standard MHA → GQA → MQA → MLA → sliding-window → mixture-of-experts routing → linear attention. That's the topic of the next post.

What's left for the next two posts

Attention variants: GQA (2023), MLA in DeepSeek-V2 (2024), sliding-window in Mistral (2023) and Gemma 3, MoE routing as the new dimension every frontier release picks differently. The same-block story falls apart here. The next post is about why it does.

State space models: Mamba (2023) and the SSM-attention hybrids that are showing up in 2025 releases. I think they're the most credible attention-replacement out there, and they're also the most ergonomically different. Worth their own post.

Check yourself

Quiz

A 1B-parameter model with V=128k vocab and d=2048 hidden dim is using tied embeddings. You're considering untying. The non-embedding parameters total ~700M. Which of the following best describes the trade?

It's not just frontier scale, either. Keller Jordan's modded-nanogpt is a hand-tuned GPT-2 speedrun where the goal is just to hit a target validation loss as fast as possible on a small budget. It ships roughly the same block, top to bottom. A 124M-parameter speedrun repo and a trillion-parameter MoE from DeepSeek, Meta, or Kimi ended up in the same place.

I find that genuinely remarkable. Multiple labs, domains, sizes, continents, and they all land on one block. It probably means we're in a local minimum that's hard to leave (which is its own kind of warning). It also means that if you're starting a project from scratch, you should pick this block as your default and deviate only where you have a measurement that justifies it.

Comments

Loading comments…

Decoder-only won

Normalisation: drop the mean, swap the order, normalise the queries

SwiGLU and the GLU family

Tokenizer: BPE → SentencePiece → byte-BPE, vocab climbing

Embedding tying: why "in" and "out" used to be the same job

Position: from a parameter table to a function

A modern block in one list

Recent launches as data points

What's left for the next two posts

Check yourself

Comments

Leave a comment