zlacker

[parent] [thread] 0 comments
1. antire+(OP)[view] [source] 2026-02-04 20:48:37
I agree with the fundamental idea that attention must be O(N^2), with the exception of recent DeepSeek sparse attention approach (DSA), that does not escape N^2 but attempts to lower constant times so much that N^2 is more acceptable, by creating a much faster layer that predicts high scoring tokens.
[go to top]