I feel like n^10 attention can capture patterns that lower complexity attention may not. So it seems arbitrary that we have n^2 attention.
Here's what attention does: every token looks at every other token to decide what's relevant. If you have n tokens, and each one looks at n others, you get n * n = n^2 operations.
Put another way: n^2 is when every token gets to look at every other token. What would n^3 be? n^10?
(sibling comment has same interpretation as you, then handwaves transformers can emulate more complex systems)