I feel like n^10 attention can capture patterns that lower complexity attention may not. So it seems arbitrary that we have n^2 attention.
Here is an illustrative example: You can write higher order polynomials as a recursive chain of first order polynomials. (Horner's method).
Things like TreeConnect [0] scale better if each TreeConnect layer has a depth of two and you add more TreeConnect layers to compensate the lack of expressivity instead of choosing a higher depth.
Attention pairs every token against every other token. n^10 would mean pairing each token with nine other tokens. The primary benefit of doing this is that you can have a "function" that accepts the interactions of 10 tokens as input to produce a single output, but you already have that if you have a ten layer network. The interactions of two tokens can form a combined token that contains information of both tokens. The network can repeat this ten times to accumulate the desired information into a single super token and then make a decision based on all ten input tokens.