zlacker

[parent] [thread] 2 comments
1. andy12+(OP)[view] [source] 2026-02-04 16:13:14
It really isn't sub N^2. The main attention is only O(Nk), but only thanks to a lightning indexer that still has complexity O(N^2). So overall it still has the same complexity; just with a smaller constant factor [1]

> DSA reduces the core attention complexity of the main model from O(L^2) to O(Lk), where k (<< L) is the number of selected tokens. Although the lightning indexer still has a complexity of O(L^2), it requires much less computation compared with MLA in DeepSeek-V3.1-Terminus

[1] https://arxiv.org/pdf/2512.02556

replies(1): >>cubefo+Pp
2. cubefo+Pp[view] [source] 2026-02-04 18:03:16
>>andy12+(OP)
Okay, then let's see whether we are going to see real linear architectures, like Gated DeltaNet or Mamba-3, in some larger models. I don't believe there is a "lower bound" which states that those can never get to (or exceed) the real-world performance of quadratic attention. (Perfect recall in unrealistic needle-in-haystack tests doesn't count.)
replies(1): >>andy12+Ug1
◧◩
3. andy12+Ug1[view] [source] [discussion] 2026-02-04 22:07:47
>>cubefo+Pp
I'm also sure that some kind of linear architecture is possible. After all, humans don't have N^2 perfect recall either.
[go to top]