← All writing

Attention variants beyond softmax: a 2026 map

A short tour of what open models actually ship in place of full softmax attention in 2025-2026. MLA, linear (lightning), sparse (NSA / DSA), and the hybrid stacks three labs landed on independently. One cache plot, one throughput plot, one Pareto sketch, and the table that names names.

Vilhelm Toivonen

Attention variants beyond softmax: a 2026 map

A year ago I'd have told you softmax attention was non-negotiable. Swap GQA in for cache, sprinkle some YaRN on top, ship. The 2025 wave of open models quietly disagreed in three different ways. This post is the map I went looking for and couldn't find: linear attention, sparse attention, MLA, and the hybrid stacks, all on the same axes.

The cache problem is what forces this. At 1M tokens, full MHA on an illustrative 48-layer / 32-head / 128-dim stack reads about 800 GB per generated token just to look at history. That's bigger than the model. Everything in this post is a different answer to "fit the cache budget without tanking quality".

The four moves

You can keep full softmax and throw tokens away (sparse). You can keep softmax and compress what you cache (MLA). You can drop softmax and run a constant-size recurrent state (linear). Or you can interleave layers of two kinds (hybrid). Every shipping open model in 2026 is some combination of these four.

One quick caveat. GPT-5, Claude Sonnet 4.5+, and Gemini don't publish their attention internals. The rest of this post is what's verifiable from open weights and papers. The closed frontier might be doing any of the below, or something none of us have thought of.

The attribution table

ModelYearTotal / activeAttentionPatternWhy this choice
DeepSeek-V22024236B / 21BMLAevery layerintroduced MLA; ~10× cache cut over MHA
DeepSeek-V3Dec 2024671B / 37BMLA + DeepSeekMoEevery layerscaled the V2 recipe up
DeepSeek-V3.2-ExpSep 2025671B / 37BMLA + DSAevery layerper-token top-k selection; O(L²) → O(Lk). Deployed.
MiniMax-01Jan 2025456B / 45.9Blinear (lightning) + softmaxhybrid, MoE1M train / 4M inference context
MiniMax-M1Jun 2025linear (lightning) + softmaxhybridreasoning model on the same kernel mix
Llama-4 (Scout/Maverick)Apr 2025MoEsoftmax + iRoPE (3:1)hybrid (positional)the hybrid here is 3 RoPE-chunked layers per 1 NoPE; not a kernel swap
Kimi K2Jul 20251.04T / 32BMLAevery layer384 experts, 64 attention heads
Kimi K2.5 / K2.6late 2025 / 20261T / 32BMLAevery layer256K context
Kimi LinearOct 2025KDA + MLA3:1 hybrid75% linear, 25% softmax
Qwen3-Next-80B-A3BSep 202580B / 3BGated DeltaNet + GQA3:1 hybridsame ratio as Kimi Linear, independent choice. 256K-1M context
Jamba-1.5-LargeAug 2024398B / 94BMamba + softmax + MoE7:1 hybrid7 Mamba layers per softmax layer; 256K context
NSAFeb 2025Native Sparse Attentioncompress + select + slidingnatively trainable. Research paper, not a shipped model.

Six families. Three answers. Let me walk them.

MLA: compress what you keep

Cleanest one to introduce first because it's still a softmax kernel, just over a smaller cached representation. DeepSeek-V2 / V3 (2024) caches a low-rank latent of dimension dcd_c (typically ~512) plus a tiny decoupled-RoPE channel. That's roughly a 10–15× reduction over MHA, and it's already enough to make 100K-token serving comfortable on a single GPU. Every Kimi model uses MLA. So does V3.2.

Linear (lightning) attention

linear attention (2020) → RetNet (2023) → GLA (2023) → DeltaNet (2024) → Gated DeltaNet (2024) → KDA / Kimi Linear (2025). MiniMax ships its own variant under the brand name lightning attention; it's the same family.

The one-line summary: replace softmax with a kernel feature map, keep a running outer-product state, get O(T) compute and a cache that's constant in T. Two things had to land before it actually worked. A delta-rule update (treat the state as an associative memory; write the error, not the value), and per-channel forget gating (each dimension picks its own retention timescale). The deep math involves chunked-recurrent kernels and a Diagonal-Plus-Low-Rank state-transition matrix; here, the only thing you need is "cache stops growing".

Sparse attention

Two designs in this family with very different vintages but the same core move: don't read every key.

  • Sliding window / chunked. The trivial baseline. Llama-4's RoPE layers cap their window at 8K; GPT-OSS-style chunked attention does the same. Cheap, but it's recall-blind.
  • NSA (Feb 2025). Hierarchical, three branches: compressed coarse blocks (a summary state), selected fine blocks (the most relevant ones), and a sliding window. Trained natively (not a finetune onto a softmax-pretrained model). Research paper, not yet in a shipped frontier model.
  • DSA, in DeepSeek-V3.2-Exp (Sep 2025). A small FP8 "lightning indexer" scores every historical key per query, then attention runs only over the top-k. Core attention drops from O(L²) to O(L·k). Shipped.

The pattern: NSA is the design study, DSA is the deployment. Both flatten the cache curve once T crosses the cap.

The cache plot

This is the figure I went looking for and couldn't find on the same axes anywhere.

KV-cache footprint vs context length

MHAGQAMLAlinear (lightning)NSA-effectiveDSA-effectivehybrid 3:1
Seven attention designs on the same 48-layer / 32-head / 128-dim stack. MHA blows the budget, GQA and MLA buy ~10× each, capped sparse (NSA, DSA) flatten once T crosses the cap, linear and the hybrid sit at the floor. Hatched horizontal line is a single-GPU VRAM budget.
32
128
48
8
512
2048

V3.2-Exp uses ~2048

512
75%

0.75 = 3:1 (Kimi Linear, Qwen3-Next)

1000000
80
MHA @ 1,000,000 tok786.4 GB
MLA @ max55.30 GB
Linear @ max0.050 GB
DSA-effective @ max1.61 GB
NSA-effective @ max13.50 GB
Hybrid 3:1 @ max13.86 GB
Hybrid vs MHA57× smaller

What jumps out:

  • MHA is off the chart fast. Seven hundred-something GB at 1M.
  • GQA and MLA both bend the curve down by a factor, not a regime. Still linear in T.
  • Linear (lightning) is a flat line at the floor. Constant in T, by construction.
  • DSA-effective and NSA-effective grow with T at first, then flatten once the cap kicks in (top-k for DSA, branch sum for NSA).
  • The 3:1 hybrid is dominated by its 25% MLA share. Three quarters of the layers are flat; one quarter does all the work.

The throughput payoff

Decoding is HBM-bandwidth-bound. Tokens/sec ≈ HBM / (KV(T) + W). Same plot, divided through.

Decode throughput vs context

MHAGQAMLAlinear (lightning)NSA-effectiveDSA-effectivehybrid 3:1
Tokens/sec on a memory-bound roofline. Same seven variants. MHA collapses, MLA bends gently, capped-sparse and hybrid stay nearly flat. Move the batch slider to watch the weight term amortise.
3000

H100 SXM ~3350; A100 ~2000

1
48
2048
75%
512
1000000
HBM bandwidth3000 GB/s
Batch1
MHA @ 1,000,0003.6 tok/s
MLA @ max29.0 tok/s
Hybrid 3:1 @ max48.5 tok/s
Linear @ max62.4 tok/s
Hybrid vs MHA13.5× faster

Trust the shapes, not the absolute multipliers. The roofline ignores arithmetic and pretends the indexer is free; the real numbers in the V3.2 and Kimi Linear papers are smaller than what this toy shows. Direction is what matters: MHA bends toward zero, the cache-flat variants don't.

Recall vs throughput, on one chart

The Pareto frontier in 2026 is roughly: pure softmax (best recall, worst cache) → MLA (good recall, ~10× better cache) → DSA / NSA (small recall hit if k is right, much better cache) → hybrid 3:1 (small recall hit, very good cache) → pure linear (recall falls off the cliff).

Pareto frontier at 1M context

MHAGQAMLAlinear (lightning)NSA-effectiveDSA-effectivehybrid 3:1
One marker per variant. Y-axis is real throughput from the cache math; X-axis is a hand-tuned recall surrogate calibrated to the rough ordering in open NIAH/RULER tables (it is not a measurement). Slide recallSpread to compress or expand the gap.
1000000
0.10

Hand-tuned. Larger = bigger gaps between variants.

2048
75%
3000
48
Context1,000,000
HBM bandwidth3000 GB/s
MHA0.99 recall · 4 tok/s
GQA0.98 recall · 21 tok/s
MLA0.95 recall · 29 tok/s
linear (lightning)0.57 recall · 62 tok/s
NSA-effective0.86 recall · 49 tok/s
DSA-effective0.92 recall · 60 tok/s
hybrid 3:10.93 recall · 48 tok/s

The recall numbers here are mine, hand-tuned. Treat the ordering as the claim, not the values. The ordering is what most NIAH and RULER tables agree on at this point.

Why three labs landed on the same ratio

Three labs landed on 3:1 independently. Moonshot's Kimi Linear. Alibaba's Qwen3-Next-80B-A3B. AI21's Jamba sits at 7:1 with Mamba in place of linear, but the principle is the same: keep enough softmax layers to cover exact-recall queries, drop the rest. I think when independent groups starting from different priors and the same constraints converge on roughly one global-attention layer per four, it's worth treating that as a fixed point of the design space, not an architectural opinion.

A year ago the working assumption was "softmax everywhere, swap GQA in for cache". Two things broke it. Linear attention finally fixed its recall problem (per-channel gating + delta rule), and sparse attention finally became natively trainable (NSA, then deployed as DSA). Hybrid stacks are what the field landed on independently in three places. I think softmax is still the centrepiece — it's just no longer on every layer.

Check yourself

Quiz

At 1M context, which variant has KV cache that is *not* dominated by terms scaling with T?

Comments

Loading comments…

Leave a comment