zlacker

[parent] [thread] 3 comments
1. cubefo+(OP)[view] [source] 2026-02-04 15:43:39
I think DeepSeek V3.2 is sub n^2, but it clearly performs quite well, refuting the alleged lower bounds in the paper.
replies(1): >>andy12+I6
2. andy12+I6[view] [source] 2026-02-04 16:13:14
>>cubefo+(OP)
It really isn't sub N^2. The main attention is only O(Nk), but only thanks to a lightning indexer that still has complexity O(N^2). So overall it still has the same complexity; just with a smaller constant factor [1]

> DSA reduces the core attention complexity of the main model from O(L^2) to O(Lk), where k (<< L) is the number of selected tokens. Although the lightning indexer still has a complexity of O(L^2), it requires much less computation compared with MLA in DeepSeek-V3.1-Terminus

[1] https://arxiv.org/pdf/2512.02556

replies(1): >>cubefo+xw
◧◩
3. cubefo+xw[view] [source] [discussion] 2026-02-04 18:03:16
>>andy12+I6
Okay, then let's see whether we are going to see real linear architectures, like Gated DeltaNet or Mamba-3, in some larger models. I don't believe there is a "lower bound" which states that those can never get to (or exceed) the real-world performance of quadratic attention. (Perfect recall in unrealistic needle-in-haystack tests doesn't count.)
replies(1): >>andy12+Cn1
◧◩◪
4. andy12+Cn1[view] [source] [discussion] 2026-02-04 22:07:47
>>cubefo+xw
I'm also sure that some kind of linear architecture is possible. After all, humans don't have N^2 perfect recall either.
[go to top]