zlacker

[return to "FlashAttention-T: Towards Tensorized Attention"]
1. simian+gp[view] [source] 2026-02-03 23:33:23
>>matt_d+(OP)
OT but instead of quadratic attention can we not have n^10 or something crazier? I feel like we are limiting the intelligence just to save cost. But I can imagine that there might be some questions that may be worth paying higher cost for.

I feel like n^10 attention can capture patterns that lower complexity attention may not. So it seems arbitrary that we have n^2 attention.

◧◩
2. noosph+aw[view] [source] 2026-02-04 00:09:04
>>simian+gp
Yes, and it works in theory.

Less so in practice. You saturate the memory of a b200 with a few dozen tokens on attentions higher than order 4. Training is even worse.

To paraphrase Knuth: high order polynomials are much more unimaginably large than mere infinity.

[go to top]