I agree with the fundamental idea that attention must be O(N^2), with the exception of recent DeepSeek sparse attention approach (DSA), that does not escape N^2 but attempts to lower constant times so much that N^2 is more acceptable, by creating a much faster layer that predicts high scoring tokens.