Note: Linear Attention

These notes introduce linear attention and its variants. We cover:

How linear attention compresses key-value history into a fixed-size state, reducing complexity from $\mathcal{O}(L^2)$ to $\mathcal{O}(L)$;
The Delta Rule and gated variants for improved memory fidelity;
Chunkwise algorithms that enable parallel training via the WY representation and UT transformation. Complete derivations and algorithms are provided.

2026/01/28