zlacker

The vast majority of benefits that can be obtained from scaling a single layer inside a neural network can often be better accomplished by having more layers instead.

Here is an illustrative example: You can write higher order polynomials as a recursive chain of first order polynomials. (Horner's method).

Things like TreeConnect [0] scale better if each TreeConnect layer has a depth of two and you add more TreeConnect layers to compensate the lack of expressivity instead of choosing a higher depth.

Attention pairs every token against every other token. n^10 would mean pairing each token with nine other tokens. The primary benefit of doing this is that you can have a "function" that accepts the interactions of 10 tokens as input to produce a single output, but you already have that if you have a ten layer network. The interactions of two tokens can form a combined token that contains information of both tokens. The network can repeat this ten times to accumulate the desired information into a single super token and then make a decision based on all ten input tokens.

[0] https://ieeexplore.ieee.org/document/8576141