Attention variants beyond softmax: a 2026 map
A short tour of what open models actually ship in place of full softmax attention in 2025-2026. MLA, linear (lightning), sparse (NSA / DSA), and the hybrid stacks three labs landed on independently. One cache plot, one throughput plot, one Pareto sketch, and the table that names names.
Attention variants beyond softmax: a 2026 map
A year ago I'd have told you softmax attention was non-negotiable. Swap GQA in for cache, sprinkle some YaRN on top, ship. The 2025 wave of open models quietly disagreed in three different ways. This post is the map I went looking for and couldn't find: linear attention, sparse attention, MLA, and the hybrid stacks, all on the same axes.
The cache problem is what forces this. At 1M tokens, full MHA on an illustrative 48-layer / 32-head / 128-dim stack reads about 800 GB per generated token just to look at history. That's bigger than the model. Everything in this post is a different answer to "fit the cache budget without tanking quality".
The four moves
You can keep full softmax and throw tokens away (sparse). You can keep softmax and compress what you cache (MLA). You can drop softmax and run a constant-size recurrent state (linear). Or you can interleave layers of two kinds (hybrid). Every shipping open model in 2026 is some combination of these four.
One quick caveat. GPT-5, Claude Sonnet 4.5+, and Gemini don't publish their attention internals. The rest of this post is what's verifiable from open weights and papers. The closed frontier might be doing any of the below, or something none of us have thought of.
The attribution table
| Model | Year | Total / active | Attention | Pattern | Why this choice |
|---|---|---|---|---|---|
| DeepSeek-V2 | 2024 | 236B / 21B | MLA | every layer | introduced MLA; ~10× cache cut over MHA |
| DeepSeek-V3 | Dec 2024 | 671B / 37B | MLA + DeepSeekMoE | every layer | scaled the V2 recipe up |
| DeepSeek-V3.2-Exp | Sep 2025 | 671B / 37B | MLA + DSA | every layer | per-token top-k selection; O(L²) → O(Lk). Deployed. |
| MiniMax-01 | Jan 2025 | 456B / 45.9B | linear (lightning) + softmax | hybrid, MoE | 1M train / 4M inference context |
| MiniMax-M1 | Jun 2025 | — | linear (lightning) + softmax | hybrid | reasoning model on the same kernel mix |
| Llama-4 (Scout/Maverick) | Apr 2025 | MoE | softmax + iRoPE (3:1) | hybrid (positional) | the hybrid here is 3 RoPE-chunked layers per 1 NoPE; not a kernel swap |
| Kimi K2 | Jul 2025 | 1.04T / 32B | MLA | every layer | 384 experts, 64 attention heads |
| Kimi K2.5 / K2.6 | late 2025 / 2026 | 1T / 32B | MLA | every layer | 256K context |
| Kimi Linear | Oct 2025 | — | KDA + MLA | 3:1 hybrid | 75% linear, 25% softmax |
| Qwen3-Next-80B-A3B | Sep 2025 | 80B / 3B | Gated DeltaNet + GQA | 3:1 hybrid | same ratio as Kimi Linear, independent choice. 256K-1M context |
| Jamba-1.5-Large | Aug 2024 | 398B / 94B | Mamba + softmax + MoE | 7:1 hybrid | 7 Mamba layers per softmax layer; 256K context |
| NSA | Feb 2025 | — | Native Sparse Attention | compress + select + sliding | natively trainable. Research paper, not a shipped model. |
Six families. Three answers. Let me walk them.
MLA: compress what you keep
Cleanest one to introduce first because it's still a softmax kernel, just over a smaller cached representation. DeepSeek-V2 / V3 (2024) caches a low-rank latent of dimension (typically ~512) plus a tiny decoupled-RoPE channel. That's roughly a 10–15× reduction over MHA, and it's already enough to make 100K-token serving comfortable on a single GPU. Every Kimi model uses MLA. So does V3.2.
Linear (lightning) attention
linear attention (2020) → RetNet (2023) → GLA (2023) → DeltaNet (2024) → Gated DeltaNet (2024) → KDA / Kimi Linear (2025). MiniMax ships its own variant under the brand name lightning attention; it's the same family.
The one-line summary: replace softmax with a kernel feature map, keep a running outer-product state, get O(T) compute and a cache that's constant in T. Two things had to land before it actually worked. A delta-rule update (treat the state as an associative memory; write the error, not the value), and per-channel forget gating (each dimension picks its own retention timescale). The deep math involves chunked-recurrent kernels and a Diagonal-Plus-Low-Rank state-transition matrix; here, the only thing you need is "cache stops growing".
Sparse attention
Two designs in this family with very different vintages but the same core move: don't read every key.
- Sliding window / chunked. The trivial baseline. Llama-4's RoPE layers cap their window at 8K; GPT-OSS-style chunked attention does the same. Cheap, but it's recall-blind.
- NSA (Feb 2025). Hierarchical, three branches: compressed coarse blocks (a summary state), selected fine blocks (the most relevant ones), and a sliding window. Trained natively (not a finetune onto a softmax-pretrained model). Research paper, not yet in a shipped frontier model.
- DSA, in DeepSeek-V3.2-Exp (Sep 2025). A small FP8 "lightning indexer" scores every historical key per query, then attention runs only over the top-k. Core attention drops from O(L²) to O(L·k). Shipped.
The pattern: NSA is the design study, DSA is the deployment. Both flatten the cache curve once T crosses the cap.
The cache plot
This is the figure I went looking for and couldn't find on the same axes anywhere.
KV-cache footprint vs context length
V3.2-Exp uses ~2048
0.75 = 3:1 (Kimi Linear, Qwen3-Next)
What jumps out:
- MHA is off the chart fast. Seven hundred-something GB at 1M.
- GQA and MLA both bend the curve down by a factor, not a regime. Still linear in T.
- Linear (lightning) is a flat line at the floor. Constant in T, by construction.
- DSA-effective and NSA-effective grow with T at first, then flatten once the cap kicks in (top-k for DSA, branch sum for NSA).
- The 3:1 hybrid is dominated by its 25% MLA share. Three quarters of the layers are flat; one quarter does all the work.
The throughput payoff
Decoding is HBM-bandwidth-bound. Tokens/sec ≈ HBM / (KV(T) + W). Same plot, divided through.
Decode throughput vs context
H100 SXM ~3350; A100 ~2000
Trust the shapes, not the absolute multipliers. The roofline ignores arithmetic and pretends the indexer is free; the real numbers in the V3.2 and Kimi Linear papers are smaller than what this toy shows. Direction is what matters: MHA bends toward zero, the cache-flat variants don't.
Recall vs throughput, on one chart
The Pareto frontier in 2026 is roughly: pure softmax (best recall, worst cache) → MLA (good recall, ~10× better cache) → DSA / NSA (small recall hit if k is right, much better cache) → hybrid 3:1 (small recall hit, very good cache) → pure linear (recall falls off the cliff).
Pareto frontier at 1M context
Hand-tuned. Larger = bigger gaps between variants.
The recall numbers here are mine, hand-tuned. Treat the ordering as the claim, not the values. The ordering is what most NIAH and RULER tables agree on at this point.
Why three labs landed on the same ratio
Three labs landed on 3:1 independently. Moonshot's Kimi Linear. Alibaba's Qwen3-Next-80B-A3B. AI21's Jamba sits at 7:1 with Mamba in place of linear, but the principle is the same: keep enough softmax layers to cover exact-recall queries, drop the rest. I think when independent groups starting from different priors and the same constraints converge on roughly one global-attention layer per four, it's worth treating that as a fixed point of the design space, not an architectural opinion.
A year ago the working assumption was "softmax everywhere, swap GQA in for cache". Two things broke it. Linear attention finally fixed its recall problem (per-channel gating + delta rule), and sparse attention finally became natively trainable (NSA, then deployed as DSA). Hybrid stacks are what the field landed on independently in three places. I think softmax is still the centrepiece — it's just no longer on every layer.
Check yourself
Quiz
At 1M context, which variant has KV cache that is *not* dominated by terms scaling with T?
Comments
Loading comments…