← All writing

The modern transformer block (everything except attention)

Open the code for Llama, Qwen, DeepSeek, Gemma, and you'll find nearly identical design decisions. What has changed since the 2017 paper, in normalisation, gating, tokenizer, embeddings, position, and why each swap stuck.

Vilhelm Toivonen

The modern transformer block (everything except attention)

I keep opening modelling code from a different lab and finding the same block. Llama, Qwen, DeepSeek, Gemma, Phi. Different teams, different stacks, different post-training recipes, but the forward pass is mostly the same. Reading Attention is All You Need (2017) now, almost everything outside the attention call has been replaced.

I think that's the surprising bit. Architectures in most fields that are monetarily this large are more diverged. This post is the catalogue of the non-attention swaps and why each one stuck. Attention itself is its own zoo (linear, sliding-window, MLA, GQA, MoE-routed) and gets a separate post. State space models are weirder still and get the post after that. Today: normalisation, the FFN gate, tokenizers, embeddings, and the structural choice that pulled positional encodings out of the parameter budget.

Decoder-only ate the field

Encoder-decoder (T5 (2019), BART (2019)) and encoder-only (BERT (2018)) had the lead in 2018-2020. Then GPT-3 (2020) showed that next-token prediction at scale gives you everything the other two were doing (translation, classification, span infilling) plus open-ended generation. By the time Llama-1 shipped (2023), encoder-only had narrowed to retrieval and embeddings, and encoder-decoder lived mostly in T5 lineages.

Causal LM → seq2seq → MLM → causal LM (with prompts). The forward pass got simpler. Everything else in this post is what filled the space the simplification opened up.

Normalisation: drop the mean, swap the order, normalise the queries

LayerNorm subtracts the mean, divides by std, applies a learned affine. Roughly 6d6d FLOPs per token. RMSNorm (2019) noticed the mean-subtraction step doesn't change the loss curve. Strip it:

RMSNorm(x)=x1dixi2+ϵg.\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2 + \epsilon}} \odot g.

Two passes, 4d\approx 4d FLOPs. The argument is, that the next linear layer can re-centre its inputs anyway, so the mean term was always redundant. The empirical part is that quality holds across every replication I've seen. So you save the FLOPs for free.

LayerNorm vs RMSNorm: per-call cost

LayerNormRMSNorm
LayerNorm at 6·d FLOPs/token, RMSNorm at 4·d. The gap opens linearly with sequence length, and you pay it twice per layer in pre-norm.
4096
8192
200
3.00
Hidden dim d4096
LayerNorm FLOPs/token24576
RMSNorm FLOPs/token16384
Savings at max seq8.4%

The other normalisation argument is about order. The original transformer puts the norm after each residual (post-norm: xl+1=norm(xl+f(xl))x_{l+1} = \text{norm}(x_l + f(x_l))). Xiong et al. (2020) showed that this contracts gradients geometrically with depth, which is why the original paper needed warmup tricks to train at all. Pre-norm (xl+1=xl+f(norm(xl))x_{l+1} = x_l + f(\text{norm}(x_l))) keeps a clean residual path through depth and trains stably without warmup. Every modern decoder I'm aware of is pre-norm.

Gradient norm through depth: pre-norm vs post-norm

pre-normpost-norm
Toy linearisation: post-norm contracts gradients by ~(1+α²)^(-1/2) per block, so ‖∇‖ falls off geometrically with depth. Pre-norm keeps the residual path open. This is why post-norm needed warmup tricks to train at all and pre-norm doesn't.
48

Llama-3 8B is 32; 70B is 80

1.00

Bigger α = stronger contraction in post-norm

Depth L48
Residual scale α1.00
Per-block contraction (post)0.707
Pre-norm ‖∇‖ at L0.837
Post-norm ‖∇‖ at L5.96e-8

The toy is a linearisation, not a real training curve, but the qualitative shape is what the literature shows: post-norm at L=48L = 48 has already lost most of its gradient signal. Crank depth up to 128 and the post-norm line touches zero.

The third normalisation move is recent. QK-Norm applies an extra RMSNorm to QQ and KK separately before the attention dot product. The story is empirical: at large scale, QKQ \cdot K logits drift to magnitudes where softmax saturates, training stalls, and loss curves blow up unpredictably. Normalising each side first keeps the logit distribution well-conditioned. It started showing up in 2020 vision transformer papers, was rediscovered for LLM stability in 2024, and is now standard in Llama-3, Gemma 3, Qwen3, and most frontier open releases. It costs almost nothing and removes a class of training instabilities. I think that's a good trade. Stanford's CS336 lecture walks through QK-Norm at 1:10:25 if you want the video version.

SwiGLU and the GLU family

The FFN block in GPT-2 is two linears with a pointwise nonlinearity:

FFN(x)=WoutGELU(Winx).\text{FFN}(x) = W_{\text{out}} \, \text{GELU}(W_{\text{in}} x).

GLU Variants (2020) replaced the pointwise step with a gated unit:

SwiGLU(x)=Wout(swish(W1x)(W2x)),\text{SwiGLU}(x) = W_{\text{out}} \, \bigl( \text{swish}(W_1 x) \odot (W_2 x) \bigr),

with swish(z)=zσ(z)\text{swish}(z) = z \cdot \sigma(z). The same paper introduced GeGLU (gelu-gated) and ReGLU (relu-gated). PaLM (2022) was the first big training run to ship SwiGLU, and Llama-1 brought it to the first widely used open model. To match parameter count with the non-gated baseline you shrink the inner width by 2/32/3, so the compute is roughly 1.5×1.5\times a plain FFN per token.

SwiGLU, GeGLU, ReGLU on a 1D slice

SwiGLUGeGLUReGLU
Same gate template, three different activations on the gate branch. The family agrees almost everywhere. The activation choice is a small effect; the gate is what matters.
1.20
0.80
4
Gate templateact(w₁·x) · (w₂·x)
SwiGLU activationswish (z·σ(z))
GeGLU activationgelu (tanh approx.)
ReGLU activationrelu

The three curves overlap most places. SwiGLU and GeGLU are almost indistinguishable on this slice (swish and gelu are similar smooth approximations to relu). ReGLU sits a hair below them with a sharper kink at the gate's zero crossing. The real loss-per-parameter ranking from the paper has SwiGLU ≈ GeGLU > ReGLU, but the gap between any of these and a plain ReLU FFN is much larger than the gap between them. The gate carves shapes a pointwise nonlinearity can't. I think that's the actual win, and the activation choice on the gate branch is mostly a footnote.

Tokenizer: BPE → SentencePiece → byte-BPE, vocab climbing

The tokenizer story shorter than the FFN story. BPE (2015) became the default early. SentencePiece (2018) added Unigram and language-agnostic byte handling. Modern OpenAI/Anthropic-lineage tokenizers (cl100k_base, o200k_base) and Llama-3's 128k tokenizer are byte-BPE variants tuned for code and multilingual coverage. Vocab grew from 32k (Llama-1, Llama-2) to 128k (Llama-3) to ~200k (GPT-4o).

The interesting open question isn't which tokenizer to pick. It's whether tokenizing should exist at all. Byte-level models (no tokenizer, learn directly on UTF-8 bytes) have looked promising on smaller scales but haven't displaced BPE at frontier scale. The bet is open: tokenizers buy you context-length efficiency (fewer tokens to represent the same text) at the cost of an extra preprocessing step that doesn't generalise across languages cleanly. Raw bytes feel architecturally cleaner. Whether they win depends on whether someone makes byte-level pretraining cost-competitive at frontier scale. I'm watching that one.

Bytes per token vs vocab size

bytes / tokenembedding share × 10
Heuristic: under a Zipf-distributed unigram with α≈1, expected token length grows like log V. Bytes/token climbs sublinearly with vocab size; the embedding-table parameter share (×10) climbs roughly linearly until total params start dominating. The sweet spot moves as the model grows.
128

32 = Llama-1/2; 128 = Llama-3; 200 = GPT-4o

4096
8.00
1.00

Smaller α = heavier tail = compression saturates earlier

Vocab V128k tokens
Hidden dim d4096
Total params P8.0B
Zipf α1.00
Bytes / token (heuristic)4.16
Embedding-table share6.55%

The compression curve is sublinear in VV. Doubling the vocab from 32k to 64k buys a meaningful drop in bytes/token; doubling again to 128k buys less; the next doubling to 256k buys almost nothing. There's a knee, and frontier tokenizers sit near it. The embedding-share curve is the other half of the trade. We'll come back to that in a later blogpost once my experiment results land.

Embedding tying: why "in" and "out" used to be the same job

Press & Wolf (2017) noticed something cheap. The input embedding maps token id → vector. The output projection (sometimes called the unembedding or LM head) maps vector → distribution over token ids. Both are V×dV \times d matrices. Why not share weights? They tested it, with great results.

The trick survived for years. GPT-2, GPT-J, Llama-1, Llama-2, Gemma 1/2 small variants all tied. Then frontier scale broke the assumption. Llama-3 (2024) untied. DeepSeek-V3 untied. Qwen3's small models tie, large models untie. One reading is, that untying the matrices do have a performance benefit with some added parameter count, so for large enough model sizes, this tradeoff is worth it. The more interesting reading is that "in" and "out" aren't quite the same job once you have enough capacity to distinguish them. The input role wants similar tokens to land near each other in vector space; the output role wants the softmax to separate them.

Embedding parameters as a share of total model

tied (V·d)untied (2·V·d)
Tied (V·d) and untied (2·V·d) embedding-table share, swept over total model size with V=128k and d=4096. At 1B the embedding is a third of the model; at 70B it's a few percent. The economic argument for tying disappears as P grows.
128
4096
0.50
200
Vocab V128k
Hidden dim d4096
P range0.5B – 200B
Tied share at max P0.26%
Untied share at max P0.52%

At V=128k, d=4096, P=1B you're spending ~50% of the model on the embedding table if you untie. Tying takes that to 25%. At P=70B the untied share is ~1.5% and tying buys you 0.7%. Below ~7B I'd still tie; above 30B the savings stop mattering and the expressivity argument starts to.

Position: from a parameter table to a function

GPT-2 stored absolute positions as a learned L×dL \times d table. Two problems. The model couldn't generalise to longer contexts than it saw. And the input was absolute, so any "this token is two positions after that one" structure had to be rediscovered by attention from the difference of two arbitrary vectors.

RoFormer (2021) made position a function. Treat each pair of dimensions in QQ and KK as a 2D subspace, rotate by mθim \cdot \theta_i at position mm, and the attention dot product depends only on the relative offset pqp - q. The structural consequence (the one I want for this post) is that the positional table moves out of the parameter budget entirely. RoPE is parameter-free. Its base bb is a hyperparameter, and you can dial it after training to push context out without retraining.

GPT-2 → RoPE: position went from "another lookup table the optimiser has to fill" to "a deterministic geometric transform applied at every layer's QQ, KK". The deep-dive (NTK-aware scaling, YaRN, NoPE, ALiBi) might live in a future post if I get time to synthesize them in a post. For this post the structural framing is the point: in the modern stack you're not training a position embedding at all.

A modern block in one list

Here's the list. Every modern open decoder layer looks roughly like this:

  1. Residual + RMSNorm (pre-norm)
  2. Attention (deferred to the next post; usually with QK-Norm and RoPE applied to QQ, KK)
  3. Residual
  4. Residual + RMSNorm
  5. SwiGLU FFN (two parallel linears into the inner width, swish gate, output projection)
  6. Residual

Outside the block: byte-BPE tokenizer at ~128k vocab, untied embeddings if the model is large, no positional table.

Modern block vs vanilla GPT-2 block: quality per FLOP

vanilla GPT-2 blockmodern block
Toy composition. Each swap contributes a quality lift and a FLOP delta; the modern block's Q/F ratio separates from the baseline as the dial moves to 1 (full stack). The third dial here is QK-Norm, a closer match to what the late-2024 stack actually ships.
4%
0.5%
1.5%
1.50
SwiGLU Δquality/ΔFLOPs+4.0% / +3.0%
RMSNorm Δquality/ΔFLOPs+0.5% / -1.0%
QK-Norm Δquality/ΔFLOPs+1.5% / +0.4%
Composite Q/F edge+5.2%

Recent launches as data points

Two years of open releases, grouped by which architectural choices they ship. The table is wider than a phone screen by design. Scroll horizontally if you're on mobile.

ModelYearNormFFNTokenizerPositionEmbedding
GPT-22019LayerNorm (post→pre)GELU50k BPElearned absolutetied
GPT-32020LayerNorm (pre)GELU50k BPElearned absolutetied
PaLM2022RMSNormSwiGLU256k SentencePieceRoPEuntied
Llama-12023RMSNormSwiGLU32k SentencePieceRoPEtied
Llama-22023RMSNormSwiGLU32k SentencePieceRoPEtied
Mistral 7B2023RMSNormSwiGLU32k SentencePieceRoPEtied
Llama-32024RMSNormSwiGLU128k byte-BPERoPE (high base)untied
Phi-42024RMSNormSwiGLUtiktoken o200kRoPEuntied
DeepSeek-V32024RMSNorm + QK-NormSwiGLU (MoE)128k byte-BPERoPE (decoupled)untied
Gemma 32025RMSNorm + QK-NormGeGLU256k SentencePieceRoPEtied (small) / untied (large)
Llama-42025RMSNorm + QK-NormSwiGLU (MoE)128k byte-BPERoPE (high base)untied
Qwen32025RMSNorm + QK-NormSwiGLU152k byte-BPERoPEtied (≤4B) / untied (≥7B)

The non-attention columns line up fast. Norm: LayerNorm → RMSNorm → RMSNorm + QK-Norm. FFN: GELU → SwiGLU (with GeGLU as a near-equivalent variant). Tokenizer: 32k SentencePiece → 128–256k byte-BPE. Position: learned absolute → RoPE (with rising bases for long context). Embedding: tied → untied past ~7B.

The attention column is the one still in flux. Standard MHA → GQA → MQA → MLA → sliding-window → mixture-of-experts routing → linear attention. That's the topic of the next post.

What's left for the next two posts

Attention variants: GQA (2023), MLA in DeepSeek-V2 (2024), sliding-window in Mistral (2023) and Gemma 3, MoE routing as the new dimension every frontier release picks differently. The same-block story falls apart here. The next post is about why it does.

State space models: Mamba (2023) and the SSM-attention hybrids that are showing up in 2025 releases. I think they're the most credible attention-replacement out there, and they're also the most ergonomically different. Worth their own post.

Check yourself

Quiz

A 1B-parameter model with V=128k vocab and d=2048 hidden dim is using tied embeddings. You're considering untying. The non-embedding parameters total ~700M. Which of the following best describes the trade?

It's not just frontier scale, either. Keller Jordan's modded-nanogpt is a hand-tuned GPT-2 speedrun where the goal is just to hit a target validation loss as fast as possible on a small budget. It ships roughly the same block, top to bottom. I find it interesting that a 124M-parameter speedrun repo and a trillion-parameter MoE from DeepSeek, Meta, or Kimi ended up in the same place.

This is amazing in my opinion. Multiple labs, domains, sizes, continents, and they all land on the same block. It probably means we're in a local minimum that's hard to leave (which is its own kind of warning), and I think it also means that if you're starting a project from scratch you should pick this block as your default and only deviate where you have a measurement that justifies it.

Comments

Loading comments…

Leave a comment