zlacker

OT but instead of quadratic attention can we not have n^10 or something crazier? I feel like we are limiting the intelligence just to save cost. But I can imagine that there might be some questions that may be worth paying higher cost for.

I feel like n^10 attention can capture patterns that lower complexity attention may not. So it seems arbitrary that we have n^2 attention.

replies(7): >>eldenr+V1 >>refulg+N3 >>storus+X5 >>noosph+U6 >>jsenn+e7 >>crysta+Gb >>imtrin+S51

>>simian+(OP)
This is a common way of thinking. In practice this type of thing is more like optimizing flop allocation. Surely with an infinite compute and parameter budget you could have a better model with more intensive operations.

Another thing to consider is that transformers are very general computers. You can encode many many more complex architectures in simpler, multi layer transformers.

>>simian+(OP)
n^2 isn't a setting someone chose, it's a mathematical consequence of what attention is.

Here's what attention does: every token looks at every other token to decide what's relevant. If you have n tokens, and each one looks at n others, you get n * n = n^2 operations.

Put another way: n^2 is when every token gets to look at every other token. What would n^3 be? n^10?

(sibling comment has same interpretation as you, then handwaves transformers can emulate more complex systems)

replies(1): >>measur+66

>>simian+(OP)
Aren't layers basically doing n^k attention? The attention block is n^2 because it allows 1 number per input/output pair. But nothing prevents you from stacking these on top of each other and get k-th order of "attentioness" with each layer encoding a different order.

>>refulg+N3
There are lots more complicated operations than comparing every token to every other token & the complexity increases when you start comparing not just token pairs but token bigrams, trigrams, & so on. There is no obvious proof that all those comparisons would be equivalent to the standard attention mechanism of comparing every token to every other one.

replies(2): >>vlovic+Y6 >>refulg+ba

>>simian+(OP)
Yes, and it works in theory.

Less so in practice. You saturate the memory of a b200 with a few dozen tokens on attentions higher than order 4. Training is even worse.

To paraphrase Knuth: high order polynomials are much more unimaginably large than mere infinity.

>>measur+66
While you are correct at a higher level, comparing bigrams/trigrams would be less compute not more because there’s fewer of them in a given text

replies(1): >>measur+qa

>>simian+(OP)
You can find papers discussing "cubic" attention, i.e. each token gets to interact with each pair of other tokens, but always in very theoretical settings with single-layer transformers on contrived synthetic tasks.

Keep in mind that LLMs have many many layers, so they have plenty of opportunity to model higher-order interactions without needing to brute force every possible combination of 10 previous tokens, of which the vast majority will be useless. Empirically, even full "quadratic" attention is not always necessary, as evidenced by the existence of linear/sparse attention variants that perform almost as well.

>>measur+66
That skips an important part: the "deep" in "deep learning".

Attention already composes across layers.

After layer 1, you're not comparing raw tokens anymore. You're comparing tokens-informed-by-their-context. By layer 20, you're effectively comparing rich representations that encode phrases, relationships, and abstract patterns. The "higher-order" stuff emerges from depth. This is the whole point of deep networks, and attention.

TL;DR for rest of comment: people have tried shallow-and-wide instead of deep, it doesn't work in practice. (rest of comment fleshes out search/ChatGPT prompt terms to look into to understand more of the technical stuff here)

A shallow network can approximate any function (universal approximation theorem), but it may need exponentially more neurons. Deep networks represent the same functions with way fewer parameters. There's formal work on "depth separation",functions that deep nets compute efficiently, but shallow nets need exponential width to match.

Empirically, People have tried shallow-and-wide vs. deep-and-narrow many times, across many domains. Deep wins consistently for the same parameter budget. This is part of why "deep learning" took off, the depth is load-bearing.

For transformers specifically, stacking attention layers is crucial. A single attention layer, even with more heads or bigger dimensions, doesn't match what you get from depth. The representations genuinely get richer in ways that width alone can't replicate.

>>vlovic+Y6
I'm correct on the technical level as well: https://chatgpt.com/s/t_698293481e308191838b4131c1b605f1

replies(1): >>refulg+0c

>>simian+(OP)
What you're missing is that there's no need to do extra work in the kernel smoothing step (what attention essentially is) because all the fancy transformation work is already happening in learning the kernel.

The feedforward networks prior to the attention layer are effectively learning sophisticated kernels. If you're unfamiliar (or for those who are) a Kernel is just a generalization of the dot product which is the most fundamental way of defining "similarity" between two points.

By learning a kernel the transformer is learning the best way to define what "similar" means for the task at hand and then we simply apply some basic smoothing over the data. This will handle all sort of interesting ways to compare points and that comparison will allow all points to provide a little bit of information.

Anything you could hope to achieve by performing more comparisons would be better solved by a better similarity function.

replies(1): >>shenbe+8y1

>>measur+qa
That math is for comparing all n-grams for all n <= N simultaneously, which isn't what was being discussed.

For any fixed n-gram size, the complexity is still O(N^2), same as standard attention.

replies(1): >>measur+Ok

>>refulg+0c
I was talking about all n-gram comparisons.

replies(1): >>refulg+Zn

>>measur+Ok
Thanks for clarifying. I was hoping to clarify the disconnect between you two, looked like on on "bigrams, trigrams, & so on." It reads idiomatically as enumerating fixed-n cases. Parsing "& so on" as "their simultaneous union" asks quite a bit of those two words. Either way, as ChatGPT showed you and you shared, all-ngram comparison brings us to O(N^3), still several exponents short of N^10 that started this thread.

replies(1): >>measur+Ao

>>refulg+Zn
This is getting tiresome. I can make the operations as complicated as necessary by comparing all possible permutations of the input string w/ every other permutation & that will not be reducible to standard attention comparisons. The n-gram was a simple example anyone should be able to understand. You can ask your favorite chatbot to compute the complexity for the permutation version.

replies(1): >>refulg+ov

>>measur+Ao
No worries! I enjoyed it fwiw, appreciate your time :) (The permutation version would be factorial, fwiw, not polynomial. Different beast entirely.)

>>simian+(OP)
The vast majority of benefits that can be obtained from scaling a single layer inside a neural network can often be better accomplished by having more layers instead.

Here is an illustrative example: You can write higher order polynomials as a recursive chain of first order polynomials. (Horner's method).

Things like TreeConnect [0] scale better if each TreeConnect layer has a depth of two and you add more TreeConnect layers to compensate the lack of expressivity instead of choosing a higher depth.

Attention pairs every token against every other token. n^10 would mean pairing each token with nine other tokens. The primary benefit of doing this is that you can have a "function" that accepts the interactions of 10 tokens as input to produce a single output, but you already have that if you have a ten layer network. The interactions of two tokens can form a combined token that contains information of both tokens. The network can repeat this ten times to accumulate the desired information into a single super token and then make a decision based on all ten input tokens.

[0] https://ieeexplore.ieee.org/document/8576141

>>crysta+Gb
There are two ingredients that don't fit in the "attention-is-kernel-smoothing" as far as I can tell: positional encoding and causal masking (another way to say positional encoding, I guess)

Also, Simplical attention is pretty much what the OP was going for, but the hardware lottery is such that it's gonna be pretty difficult to get competitive in terms of engineering, not that people aren't trying (e.g. https://arxiv.org/pdf/2507.02754)