FlashAttention-T: Towards Tensorized Attention

>>matt_d+(OP)
OT but instead of quadratic attention can we not have n^10 or something crazier? I feel like we are limiting the intelligence just to save cost. But I can imagine that there might be some questions that may be worth paying higher cost for.

I feel like n^10 attention can capture patterns that lower complexity attention may not. So it seems arbitrary that we have n^2 attention.

>>simian+gp
n^2 isn't a setting someone chose, it's a mathematical consequence of what attention is.

Here's what attention does: every token looks at every other token to decide what's relevant. If you have n tokens, and each one looks at n others, you get n * n = n^2 operations.

Put another way: n^2 is when every token gets to look at every other token. What would n^3 be? n^10?

(sibling comment has same interpretation as you, then handwaves transformers can emulate more complex systems)

>>refulg+3t
There are lots more complicated operations than comparing every token to every other token & the complexity increases when you start comparing not just token pairs but token bigrams, trigrams, & so on. There is no obvious proof that all those comparisons would be equivalent to the standard attention mechanism of comparing every token to every other one.

>>measur+mv
While you are correct at a higher level, comparing bigrams/trigrams would be less compute not more because there’s fewer of them in a given text

>>vlovic+ew
I'm correct on the technical level as well: https://chatgpt.com/s/t_698293481e308191838b4131c1b605f1

zlacker