The modern transformer block (everything except attention)
Open the code for Llama, Qwen, DeepSeek, Gemma, and you'll find nearly identical design decisions. What has changed since the 2017 paper, in normalisation, gating, tokenizer, embeddings, position, and why each swap stuck.
The modern transformer block (everything except attention)
I keep opening modelling code from a different lab and finding the same block. Llama, Qwen, DeepSeek, Gemma, Phi. Different teams, different stacks, different post-training recipes, but the forward pass is mostly the same. Reading Attention is All You Need (2017) now, almost everything outside the attention call has been replaced.
I think that's the surprising bit. Architectures in most fields that are monetarily this large are more diverged. This post is the catalogue of the non-attention swaps and why each one stuck. Attention itself is its own zoo (linear, sliding-window, MLA, GQA, MoE-routed) and gets a separate post. State space models are weirder still and get the post after that. Today: normalisation, the FFN gate, tokenizers, embeddings, and the structural choice that pulled positional encodings out of the parameter budget.
Decoder-only ate the field
Encoder-decoder (T5 (2019), BART (2019)) and encoder-only (BERT (2018)) had the lead in 2018-2020. Then GPT-3 (2020) showed that next-token prediction at scale gives you everything the other two were doing (translation, classification, span infilling) plus open-ended generation. By the time Llama-1 shipped (2023), encoder-only had narrowed to retrieval and embeddings, and encoder-decoder lived mostly in T5 lineages.
Causal LM → seq2seq → MLM → causal LM (with prompts). The forward pass got simpler. Everything else in this post is what filled the space the simplification opened up.
Normalisation: drop the mean, swap the order, normalise the queries
LayerNorm subtracts the mean, divides by std, applies a learned affine. Roughly FLOPs per token. RMSNorm (2019) noticed the mean-subtraction step doesn't change the loss curve. Strip it:
Two passes, FLOPs. The argument is, that the next linear layer can re-centre its inputs anyway, so the mean term was always redundant. The empirical part is that quality holds across every replication I've seen. So you save the FLOPs for free.
LayerNorm vs RMSNorm: per-call cost
The other normalisation argument is about order. The original transformer puts the norm after each residual (post-norm: ). Xiong et al. (2020) showed that this contracts gradients geometrically with depth, which is why the original paper needed warmup tricks to train at all. Pre-norm () keeps a clean residual path through depth and trains stably without warmup. Every modern decoder I'm aware of is pre-norm.
Gradient norm through depth: pre-norm vs post-norm
Llama-3 8B is 32; 70B is 80
Bigger α = stronger contraction in post-norm
The toy is a linearisation, not a real training curve, but the qualitative shape is what the literature shows: post-norm at has already lost most of its gradient signal. Crank depth up to 128 and the post-norm line touches zero.
The third normalisation move is recent. QK-Norm applies an extra RMSNorm to and separately before the attention dot product. The story is empirical: at large scale, logits drift to magnitudes where softmax saturates, training stalls, and loss curves blow up unpredictably. Normalising each side first keeps the logit distribution well-conditioned. It started showing up in 2020 vision transformer papers, was rediscovered for LLM stability in 2024, and is now standard in Llama-3, Gemma 3, Qwen3, and most frontier open releases. It costs almost nothing and removes a class of training instabilities. I think that's a good trade. Stanford's CS336 lecture walks through QK-Norm at 1:10:25 if you want the video version.
SwiGLU and the GLU family
The FFN block in GPT-2 is two linears with a pointwise nonlinearity:
GLU Variants (2020) replaced the pointwise step with a gated unit:
with . The same paper introduced GeGLU (gelu-gated) and ReGLU (relu-gated). PaLM (2022) was the first big training run to ship SwiGLU, and Llama-1 brought it to the first widely used open model. To match parameter count with the non-gated baseline you shrink the inner width by , so the compute is roughly a plain FFN per token.
SwiGLU, GeGLU, ReGLU on a 1D slice
The three curves overlap most places. SwiGLU and GeGLU are almost indistinguishable on this slice (swish and gelu are similar smooth approximations to relu). ReGLU sits a hair below them with a sharper kink at the gate's zero crossing. The real loss-per-parameter ranking from the paper has SwiGLU ≈ GeGLU > ReGLU, but the gap between any of these and a plain ReLU FFN is much larger than the gap between them. The gate carves shapes a pointwise nonlinearity can't. I think that's the actual win, and the activation choice on the gate branch is mostly a footnote.
Tokenizer: BPE → SentencePiece → byte-BPE, vocab climbing
The tokenizer story shorter than the FFN story. BPE (2015) became the default early. SentencePiece (2018) added Unigram and language-agnostic byte handling. Modern OpenAI/Anthropic-lineage tokenizers (cl100k_base, o200k_base) and Llama-3's 128k tokenizer are byte-BPE variants tuned for code and multilingual coverage. Vocab grew from 32k (Llama-1, Llama-2) to 128k (Llama-3) to ~200k (GPT-4o).
The interesting open question isn't which tokenizer to pick. It's whether tokenizing should exist at all. Byte-level models (no tokenizer, learn directly on UTF-8 bytes) have looked promising on smaller scales but haven't displaced BPE at frontier scale. The bet is open: tokenizers buy you context-length efficiency (fewer tokens to represent the same text) at the cost of an extra preprocessing step that doesn't generalise across languages cleanly. Raw bytes feel architecturally cleaner. Whether they win depends on whether someone makes byte-level pretraining cost-competitive at frontier scale. I'm watching that one.
Bytes per token vs vocab size
32 = Llama-1/2; 128 = Llama-3; 200 = GPT-4o
Smaller α = heavier tail = compression saturates earlier
The compression curve is sublinear in . Doubling the vocab from 32k to 64k buys a meaningful drop in bytes/token; doubling again to 128k buys less; the next doubling to 256k buys almost nothing. There's a knee, and frontier tokenizers sit near it. The embedding-share curve is the other half of the trade. We'll come back to that in a later blogpost once my experiment results land.
Embedding tying: why "in" and "out" used to be the same job
Press & Wolf (2017) noticed something cheap. The input embedding maps token id → vector. The output projection (sometimes called the unembedding or LM head) maps vector → distribution over token ids. Both are matrices. Why not share weights? They tested it, with great results.
The trick survived for years. GPT-2, GPT-J, Llama-1, Llama-2, Gemma 1/2 small variants all tied. Then frontier scale broke the assumption. Llama-3 (2024) untied. DeepSeek-V3 untied. Qwen3's small models tie, large models untie. One reading is, that untying the matrices do have a performance benefit with some added parameter count, so for large enough model sizes, this tradeoff is worth it. The more interesting reading is that "in" and "out" aren't quite the same job once you have enough capacity to distinguish them. The input role wants similar tokens to land near each other in vector space; the output role wants the softmax to separate them.
Embedding parameters as a share of total model
At V=128k, d=4096, P=1B you're spending ~50% of the model on the embedding table if you untie. Tying takes that to 25%. At P=70B the untied share is ~1.5% and tying buys you 0.7%. Below ~7B I'd still tie; above 30B the savings stop mattering and the expressivity argument starts to.
Position: from a parameter table to a function
GPT-2 stored absolute positions as a learned table. Two problems. The model couldn't generalise to longer contexts than it saw. And the input was absolute, so any "this token is two positions after that one" structure had to be rediscovered by attention from the difference of two arbitrary vectors.
RoFormer (2021) made position a function. Treat each pair of dimensions in and as a 2D subspace, rotate by at position , and the attention dot product depends only on the relative offset . The structural consequence (the one I want for this post) is that the positional table moves out of the parameter budget entirely. RoPE is parameter-free. Its base is a hyperparameter, and you can dial it after training to push context out without retraining.
GPT-2 → RoPE: position went from "another lookup table the optimiser has to fill" to "a deterministic geometric transform applied at every layer's , ". The deep-dive (NTK-aware scaling, YaRN, NoPE, ALiBi) might live in a future post if I get time to synthesize them in a post. For this post the structural framing is the point: in the modern stack you're not training a position embedding at all.
A modern block in one list
Here's the list. Every modern open decoder layer looks roughly like this:
- Residual + RMSNorm (pre-norm)
- Attention (deferred to the next post; usually with QK-Norm and RoPE applied to , )
- Residual
- Residual + RMSNorm
- SwiGLU FFN (two parallel linears into the inner width, swish gate, output projection)
- Residual
Outside the block: byte-BPE tokenizer at ~128k vocab, untied embeddings if the model is large, no positional table.
Modern block vs vanilla GPT-2 block: quality per FLOP
Recent launches as data points
Two years of open releases, grouped by which architectural choices they ship. The table is wider than a phone screen by design. Scroll horizontally if you're on mobile.
| Model | Year | Norm | FFN | Tokenizer | Position | Embedding |
|---|---|---|---|---|---|---|
| GPT-2 | 2019 | LayerNorm (post→pre) | GELU | 50k BPE | learned absolute | tied |
| GPT-3 | 2020 | LayerNorm (pre) | GELU | 50k BPE | learned absolute | tied |
| PaLM | 2022 | RMSNorm | SwiGLU | 256k SentencePiece | RoPE | untied |
| Llama-1 | 2023 | RMSNorm | SwiGLU | 32k SentencePiece | RoPE | tied |
| Llama-2 | 2023 | RMSNorm | SwiGLU | 32k SentencePiece | RoPE | tied |
| Mistral 7B | 2023 | RMSNorm | SwiGLU | 32k SentencePiece | RoPE | tied |
| Llama-3 | 2024 | RMSNorm | SwiGLU | 128k byte-BPE | RoPE (high base) | untied |
| Phi-4 | 2024 | RMSNorm | SwiGLU | tiktoken o200k | RoPE | untied |
| DeepSeek-V3 | 2024 | RMSNorm + QK-Norm | SwiGLU (MoE) | 128k byte-BPE | RoPE (decoupled) | untied |
| Gemma 3 | 2025 | RMSNorm + QK-Norm | GeGLU | 256k SentencePiece | RoPE | tied (small) / untied (large) |
| Llama-4 | 2025 | RMSNorm + QK-Norm | SwiGLU (MoE) | 128k byte-BPE | RoPE (high base) | untied |
| Qwen3 | 2025 | RMSNorm + QK-Norm | SwiGLU | 152k byte-BPE | RoPE | tied (≤4B) / untied (≥7B) |
The non-attention columns line up fast. Norm: LayerNorm → RMSNorm → RMSNorm + QK-Norm. FFN: GELU → SwiGLU (with GeGLU as a near-equivalent variant). Tokenizer: 32k SentencePiece → 128–256k byte-BPE. Position: learned absolute → RoPE (with rising bases for long context). Embedding: tied → untied past ~7B.
The attention column is the one still in flux. Standard MHA → GQA → MQA → MLA → sliding-window → mixture-of-experts routing → linear attention. That's the topic of the next post.
What's left for the next two posts
Attention variants: GQA (2023), MLA in DeepSeek-V2 (2024), sliding-window in Mistral (2023) and Gemma 3, MoE routing as the new dimension every frontier release picks differently. The same-block story falls apart here. The next post is about why it does.
State space models: Mamba (2023) and the SSM-attention hybrids that are showing up in 2025 releases. I think they're the most credible attention-replacement out there, and they're also the most ergonomically different. Worth their own post.
Check yourself
Quiz
A 1B-parameter model with V=128k vocab and d=2048 hidden dim is using tied embeddings. You're considering untying. The non-embedding parameters total ~700M. Which of the following best describes the trade?
It's not just frontier scale, either. Keller Jordan's modded-nanogpt is a hand-tuned GPT-2 speedrun where the goal is just to hit a target validation loss as fast as possible on a small budget. It ships roughly the same block, top to bottom. I find it interesting that a 124M-parameter speedrun repo and a trillion-parameter MoE from DeepSeek, Meta, or Kimi ended up in the same place.
This is amazing in my opinion. Multiple labs, domains, sizes, continents, and they all land on the same block. It probably means we're in a local minimum that's hard to leave (which is its own kind of warning), and I think it also means that if you're starting a project from scratch you should pick this block as your default and only deviate where you have a measurement that justifies it.
Comments
Loading comments…