I feel like n^10 attention can capture patterns that lower complexity attention may not. So it seems arbitrary that we have n^2 attention.
Another thing to consider is that transformers are very general computers. You can encode many many more complex architectures in simpler, multi layer transformers.