FlashAttention-T: Towards Tensorized Attention

>>matt_d+(OP)
OT but instead of quadratic attention can we not have n^10 or something crazier? I feel like we are limiting the intelligence just to save cost. But I can imagine that there might be some questions that may be worth paying higher cost for.

I feel like n^10 attention can capture patterns that lower complexity attention may not. So it seems arbitrary that we have n^2 attention.

>>simian+gp
The vast majority of benefits that can be obtained from scaling a single layer inside a neural network can often be better accomplished by having more layers instead.

Here is an illustrative example: You can write higher order polynomials as a recursive chain of first order polynomials. (Horner's method).

Things like TreeConnect [0] scale better if each TreeConnect layer has a depth of two and you add more TreeConnect layers to compensate the lack of expressivity instead of choosing a higher depth.

Attention pairs every token against every other token. n^10 would mean pairing each token with nine other tokens. The primary benefit of doing this is that you can have a "function" that accepts the interactions of 10 tokens as input to produce a single output, but you already have that if you have a ten layer network. The interactions of two tokens can form a combined token that contains information of both tokens. The network can repeat this ten times to accumulate the desired information into a single super token and then make a decision based on all ten input tokens.

[0] https://ieeexplore.ieee.org/document/8576141

zlacker