zlacker

[return to "FlashAttention-T: Towards Tensorized Attention"]
1. simian+gp[view] [source] 2026-02-03 23:33:23
>>matt_d+(OP)
OT but instead of quadratic attention can we not have n^10 or something crazier? I feel like we are limiting the intelligence just to save cost. But I can imagine that there might be some questions that may be worth paying higher cost for.

I feel like n^10 attention can capture patterns that lower complexity attention may not. So it seems arbitrary that we have n^2 attention.

◧◩
2. refulg+3t[view] [source] 2026-02-03 23:52:28
>>simian+gp
n^2 isn't a setting someone chose, it's a mathematical consequence of what attention is.

Here's what attention does: every token looks at every other token to decide what's relevant. If you have n tokens, and each one looks at n others, you get n * n = n^2 operations.

Put another way: n^2 is when every token gets to look at every other token. What would n^3 be? n^10?

(sibling comment has same interpretation as you, then handwaves transformers can emulate more complex systems)

◧◩◪
3. measur+mv[view] [source] 2026-02-04 00:05:13
>>refulg+3t
There are lots more complicated operations than comparing every token to every other token & the complexity increases when you start comparing not just token pairs but token bigrams, trigrams, & so on. There is no obvious proof that all those comparisons would be equivalent to the standard attention mechanism of comparing every token to every other one.
◧◩◪◨
4. vlovic+ew[view] [source] 2026-02-04 00:09:40
>>measur+mv
While you are correct at a higher level, comparing bigrams/trigrams would be less compute not more because there’s fewer of them in a given text
◧◩◪◨⬒
5. measur+Gz[view] [source] 2026-02-04 00:31:09
>>vlovic+ew
I'm correct on the technical level as well: https://chatgpt.com/s/t_698293481e308191838b4131c1b605f1
◧◩◪◨⬒⬓
6. refulg+gB[view] [source] 2026-02-04 00:41:30
>>measur+Gz
That math is for comparing all n-grams for all n <= N simultaneously, which isn't what was being discussed.

For any fixed n-gram size, the complexity is still O(N^2), same as standard attention.

◧◩◪◨⬒⬓⬔
7. measur+4K[view] [source] 2026-02-04 01:39:15
>>refulg+gB
I was talking about all n-gram comparisons.
◧◩◪◨⬒⬓⬔⧯
8. refulg+fN[view] [source] 2026-02-04 02:01:04
>>measur+4K
Thanks for clarifying. I was hoping to clarify the disconnect between you two, looked like on on "bigrams, trigrams, & so on." It reads idiomatically as enumerating fixed-n cases. Parsing "& so on" as "their simultaneous union" asks quite a bit of those two words. Either way, as ChatGPT showed you and you shared, all-ngram comparison brings us to O(N^3), still several exponents short of N^10 that started this thread.
◧◩◪◨⬒⬓⬔⧯▣
9. measur+QN[view] [source] 2026-02-04 02:06:38
>>refulg+fN
This is getting tiresome. I can make the operations as complicated as necessary by comparing all possible permutations of the input string w/ every other permutation & that will not be reducible to standard attention comparisons. The n-gram was a simple example anyone should be able to understand. You can ask your favorite chatbot to compute the complexity for the permutation version.
[go to top]