Attention variants beyond softmax: a 2026 map

A year ago I'd have told you softmax attention was non-negotiable. Swap GQA in for cache, sprinkle some YaRN on top, ship. The 2025 wave of open models broke that assumption in three different ways. This post is the map I went looking for and couldn't find: linear attention, sparse attention, MLA, and the hybrid stacks, all on the same axes. Fair warning about what this is: I'm reading papers and model cards here, not running internal eval suites, and every plot below is a toy model built to show shapes rather than a measurement.

The cache problem is what forces all of it. At 1M tokens, full MHA on an illustrative 48-layer / 32-head / 128-dim stack reads about 800 GB per generated token just to look at history. That's bigger than the model. Everything in this post is a different answer to "fit the cache budget without tanking quality".

The four moves

You can keep full softmax and throw tokens away (sparse). You can keep softmax and compress what you cache (MLA). You can drop softmax and run a constant-size recurrent state (linear). Or you can interleave layers of two kinds (hybrid). Every shipping open model in 2026 is some combination of these four.

One caveat before the table. GPT-5, Claude Sonnet 4.5+, and Gemini don't publish their attention internals, so everything here is what's verifiable from open weights and papers. The closed frontier might be doing any of the below, or something none of us have thought of.

The attribution table

Model	Year	Total / active	Attention	Pattern	Why this choice
DeepSeek-V2	2024	236B / 21B	MLA	every layer	introduced MLA; ~10× cache cut over MHA
DeepSeek-V3	Dec 2024	671B / 37B	MLA + DeepSeekMoE	every layer	scaled the V2 recipe up
DeepSeek-V3.2-Exp	Sep 2025	671B / 37B	MLA + DSA	every layer	per-token top-k selection; O(L²) → O(Lk). Deployed.
MiniMax-01	Jan 2025	456B / 45.9B	linear (lightning) + softmax	hybrid, MoE	1M train / 4M inference context
MiniMax-M1	Jun 2025	n/a	linear (lightning) + softmax	hybrid	reasoning model on the same kernel mix
Llama-4 (Scout/Maverick)	Apr 2025	MoE	softmax + iRoPE (3:1)	hybrid (positional)	the hybrid here is 3 RoPE-chunked layers per 1 NoPE; not a kernel swap
Kimi K2	Jul 2025	1.04T / 32B	MLA	every layer	384 experts, 64 attention heads
Kimi K2.5 / K2.6	late 2025 / 2026	1T / 32B	MLA	every layer	256K context
Kimi Linear	Oct 2025	n/a	KDA + MLA	3:1 hybrid	75% linear, 25% softmax
Qwen3-Next-80B-A3B	Sep 2025	80B / 3B	Gated DeltaNet + GQA	3:1 hybrid	same ratio as Kimi Linear, independent choice. 256K-1M context
Jamba-1.5-Large	Aug 2024	398B / 94B	Mamba + softmax + MoE	7:1 hybrid	7 Mamba layers per softmax layer; 256K context
NSA	Feb 2025	n/a	Native Sparse Attention	compress + select + sliding	natively trainable. Research paper, not a shipped model.

Six families. Three answers.

MLA: compress what you keep

Cleanest one to introduce first because it's still a softmax kernel, just over a smaller cached representation. DeepSeek-V2 / V3 (2024) caches a low-rank latent of dimension $d_c$ (typically ~512) plus a tiny decoupled-RoPE channel. That's roughly a 10–15× reduction over MHA, and it's already enough to make 100K-token serving comfortable on a single GPU. Every Kimi model uses MLA. So does V3.2.

Linear (lightning) attention

linear attention (2020) → RetNet (2023) → GLA (2023) → DeltaNet (2024) → Gated DeltaNet (2024) → KDA / Kimi Linear (2025). MiniMax ships its own variant under the brand name lightning attention; it's the same family.

The one-line summary: replace softmax with a kernel feature map, keep a running outer-product state, get O(T) compute and a cache that's constant in T. Two things had to land before it actually worked. A delta-rule update (treat the state as an associative memory; write the error, not the value), and per-channel forget gating (each dimension picks its own retention timescale). The deep math involves chunked-recurrent kernels and a Diagonal-Plus-Low-Rank state-transition matrix; here, the only thing you need is "cache stops growing".

Sparse attention

Two designs in this family with very different vintages but the same core move: don't read every key.

Sliding window / chunked. The trivial baseline. Llama-4's RoPE layers cap their window at 8K; GPT-OSS-style chunked attention does the same. Cheap, but it's recall-blind.
NSA (Feb 2025). Hierarchical, three branches: compressed coarse blocks (a summary state), selected fine blocks (the most relevant ones), and a sliding window. Trained natively (not a finetune onto a softmax-pretrained model). Research paper, not yet in a shipped frontier model.
DSA, in DeepSeek-V3.2-Exp (Sep 2025). A small FP8 "lightning indexer" scores every historical key per query, then attention runs only over the top-k. Core attention drops from O(L²) to O(L·k). Shipped.

The pattern: NSA is the design study, DSA is the deployment. Both flatten the cache curve once T crosses the cap.

The cache plot

Here's the figure this post exists for.

KV-cache footprint vs context length

MHAGQAMLAlinear (lightning)NSA-effectiveDSA-effectivehybrid 3:1

Seven designs, one model shape. Watch which regime each curve lives in: MHA blows the budget early, GQA and MLA bend the line down without changing its slope's nature, capped sparse (NSA, DSA) flattens once T crosses the cap, and linear plus the hybrid sit at the floor. The hatched horizontal line is a single-GPU VRAM budget; find where each curve crosses it.

Heads h32

Head dim d128

Layers L48

GQA group size8

MLA latent rank d_c512

DSA top-k2048

V3.2-Exp uses ~2048

NSA window512

Hybrid linear share75%

0.75 = 3:1 (Kimi Linear, Qwen3-Next)

Max context1000000

VRAM budget (GB)80

Drop to 24 GB and see which curves survive

MHA @ 1,000,000 tok786.4 GB

MLA @ max55.30 GB

Linear @ max0.050 GB

DSA-effective @ max1.61 GB

NSA-effective @ max13.50 GB

Hybrid 3:1 @ max13.86 GB

Hybrid vs MHA57× smaller

What jumps out:

MHA is off the chart fast. Seven hundred-something GB at 1M.
GQA and MLA both bend the curve down by a factor, not a regime. Still linear in T.
Linear (lightning) is a flat line at the floor. Constant in T, by construction.
DSA-effective and NSA-effective grow with T at first, then flatten once the cap kicks in (top-k for DSA, branch sum for NSA).
The 3:1 hybrid is dominated by its 25% MLA share. Three quarters of the layers are flat; one quarter does all the work.

The throughput payoff

Decoding is HBM-bandwidth-bound. Tokens/sec ≈ HBM / (KV(T) + W). Same plot, divided through.

Decode throughput vs context

MHAGQAMLAlinear (lightning)NSA-effectiveDSA-effectivehybrid 3:1

Tokens/sec on a memory-bound roofline, same seven variants. MHA collapses toward zero, MLA bends gently, capped-sparse and the hybrid stay nearly flat. Crank the batch slider and the weight term amortises away.

HBM bandwidth (GB/s)3000

H100 SXM ~3350; A100 ~2000

Batch1

Active weights (GB)48

DSA top-k2048

Hybrid linear share75%

MLA latent rank d_c512

Max context1000000

HBM bandwidth3000 GB/s

Batch1

MHA @ 1,000,0003.6 tok/s

MLA @ max29.0 tok/s

Hybrid 3:1 @ max48.5 tok/s

Linear @ max62.4 tok/s

Hybrid vs MHA13.5× faster

Trust the shapes, not the absolute multipliers. The roofline ignores arithmetic and pretends the indexer is free; the real numbers in the V3.2 and Kimi Linear papers are smaller than what this toy shows. Direction is what matters: MHA bends toward zero, the cache-flat variants don't.

Recall vs throughput, on one chart

The Pareto frontier in 2026 is roughly: pure softmax (best recall, worst cache) → MLA (good recall, ~10× better cache) → DSA / NSA (small recall hit if k is right, much better cache) → hybrid 3:1 (small recall hit, very good cache) → pure linear (recall falls off).

Pareto frontier at 1M context

MHAGQAMLAlinear (lightning)NSA-effectiveDSA-effectivehybrid 3:1

One marker per variant. Y-axis is real throughput from the cache math; X-axis is a hand-tuned recall surrogate calibrated to the rough ordering in open NIAH/RULER tables (not a measurement). Slide the recall spread to stretch or squash the gaps; the ordering holds either way.

Context1000000

Recall spread0.10

Hand-tuned. Larger = bigger gaps between variants.

DSA top-k2048

Hybrid linear share75%

HBM bandwidth (GB/s)3000

Active weights (GB)48

Context1,000,000

HBM bandwidth3000 GB/s

MHA0.99 recall · 4 tok/s

GQA0.98 recall · 21 tok/s

MLA0.95 recall · 29 tok/s

linear (lightning)0.57 recall · 62 tok/s

NSA-effective0.86 recall · 49 tok/s

DSA-effective0.92 recall · 60 tok/s

hybrid 3:10.93 recall · 48 tok/s

The recall numbers here are mine, hand-tuned. Treat the ordering as the claim, not the values. The ordering is what most NIAH and RULER tables agree on at this point.

Why three labs landed on the same ratio

Three labs landed on 3:1 independently. Moonshot's Kimi Linear. Alibaba's Qwen3-Next-80B-A3B. AI21's Jamba sits at 7:1 with Mamba in place of linear, but the principle is the same: keep enough softmax layers to cover exact-recall queries, drop the rest. I think when independent groups starting from different priors and the same constraints converge on roughly one global-attention layer per four, it's worth treating that as a fixed point of the design space, not an architectural opinion.

A year ago the working assumption was "softmax everywhere, swap GQA in for cache". Two things broke it. Linear attention finally fixed its recall problem (per-channel gating + delta rule), and sparse attention finally became natively trainable (NSA, then deployed as DSA). Hybrid stacks showed up independently at three labs. I think softmax is still the centrepiece. It's just no longer on every layer.

If a row in the table is wrong or your favourite variant is missing, tell me in the comments. I'd rather fix the map than defend it.

Check yourself

Quiz

At 1M context, which variant has KV cache that is *not* dominated by terms scaling with T?

Comments

Loading comments…

The four moves

The attribution table

MLA: compress what you keep

Linear (lightning) attention

Sparse attention

The cache plot

The throughput payoff

Recall vs throughput, on one chart

Why three labs landed on the same ratio

Check yourself

Comments

Leave a comment