zlacker

[return to "Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation"]
1. thomas+Yc[view] [source] 2026-02-04 15:33:26
>>fheins+(OP)
There's a graveyard of 100s of papers with "approximate near linear time attention."

They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.

Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214

◧◩
2. cubefo+hf[view] [source] 2026-02-04 15:43:39
>>thomas+Yc
I think DeepSeek V3.2 is sub n^2, but it clearly performs quite well, refuting the alleged lower bounds in the paper.
◧◩◪
3. andy12+Zl[view] [source] 2026-02-04 16:13:14
>>cubefo+hf
It really isn't sub N^2. The main attention is only O(Nk), but only thanks to a lightning indexer that still has complexity O(N^2). So overall it still has the same complexity; just with a smaller constant factor [1]

> DSA reduces the core attention complexity of the main model from O(L^2) to O(Lk), where k (<< L) is the number of selected tokens. Although the lightning indexer still has a complexity of O(L^2), it requires much less computation compared with MLA in DeepSeek-V3.1-Terminus

[1] https://arxiv.org/pdf/2512.02556

[go to top]