That's reminding me of deep neural networks where single layer networks could achieve the same results, but the layer would have to be excessively large. Maybe we're re-using the same kind of improvement, scaling in length instead of width because of our computation limitations ?