I swear, the big reason models are black boxes are because we _want_ them to be. There's clear anti-sentiment mentality against people doing theory and the result of this shows. I remember not too long ago Yi Tay (under @agihippo but main is @YiTayML) said "fuck theorists". I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.
Also, I'd like to point out, the author uses "we" but the paper only has one author on it. So may I suggest adding their cat as a coauthor? [0]
Because the synthetic linear languages are computationally/structurally simple LLMs will, unlike humans, learn them just as easily as real human languages. Since this hierarchical aspect of human language seems fundamental/important LLMs therefore are not a good model of the human language faculty.
If you want to refute that claim then you would take similar synthetic language constructions to those that were used in the experiments and show that LLMs take longer to learn them.
Instead you mostly created an abstraction of the problem that no one cares about: that there exist certain synthetic language constructions that LLMs have difficulty with. But this is both trivial (consider a language that requires you to factor numbers to decode it) and irrelevant (there is no relation to what humans do except in an abstract sense).
The one language that you use that is most similar to the linear languages cited by Moro, "Hop", shows very little difference in performance, directly undermining your claimed refutation of Chomsky.
Thanks for your feedback. I think our manipulations do establish that there are nontrivial inductive biases in Transformer language models and that these inductive biases are aligned with human language in important ways. There's no universal a priori sense in which Moro's linear counting languages are "simple" but our deterministically shuffled languages aren't. It seems that GPT language models do favor real language over the perturbed ones, and this shows that they have a simplicity bias which aligns with human language. This is remarkable, considering that the GPT architecture doesn't look like what one would expect based on existing linguistic theory.
Furthermore, this alignment is interesting even if it isn't perfect. I would be shocked in GPT language models happened to have inductive biases that perfectly match the structure of human language---why would they? But it is still worthwhile to probe what those inductive biases are and to compare them with what humans do. As a comparison, context-free grammars turned out to be an imperfect model of syntax, but the field of syntax benefited a lot from exploring them and their limits. Something similar is happening now with neural language models as models of language learning and processing, a very active research field. So I wouldn't say that neural language models can't shed any light on language simply because they're not a perfect match for a particular aspect of language.
As for using languages more directly based on the Moro experiments, we've discussed this extensively. There are nontrivial challenges in scaling those languages up to the point that you can have a realistic training set, where the control condition is a real language instead of a toy language, without introducing confounds of various kinds. We're open to suggestions. We've had very productive conversations with syntacticians about how to formulate new baselines in future work.
More generally our goal was to get formal linguists more interested in defining the impossible vs. possible language distinction more carefully, to the point that they can be used to test the inductive biases of neural models. It's not as simple as hierarchical vs. linear, since there are purely linear phenomena in syntax such as Closest Conjunct Agreement, and also morphophonological processes can act linearly across constituent boundaries, among other complications.
> The one language that you use that is most similar to the linear languages cited by Moro, "Hop", shows very little difference in performance, directly undermining your claimed refutation of Chomsky.
I wouldn't read much into the magnitude of the difference between NoHop and Hop, because the Hop transformation only affects a small number of sentences, and the perplexity metric is an average over sentences.