My hunch is it emerges naturally out of the hierarchical generalization capabilities of multiple layer circuits. But then you need something to coordinate the acquired labels: a tweak on attention perhaps?
Another characteristic is probably some (limited) form of recursion, so the generalized labels emitted at the small end can be fed back in as tokens to be further processed at the big end.