FlashAttention-T: Towards Tensorized Attention

>>matt_d+(OP)
OT but instead of quadratic attention can we not have n^10 or something crazier? I feel like we are limiting the intelligence just to save cost. But I can imagine that there might be some questions that may be worth paying higher cost for.

I feel like n^10 attention can capture patterns that lower complexity attention may not. So it seems arbitrary that we have n^2 attention.

>>simian+gp
Yes, and it works in theory.

Less so in practice. You saturate the memory of a b200 with a few dozen tokens on attentions higher than order 4. Training is even worse.

To paraphrase Knuth: high order polynomials are much more unimaginably large than mere infinity.

zlacker