zlacker

>>canjob+(OP)
I see you're one of the authors.

I disagree with the conclusions of the paper. Maybe I have some misunderstandings, and if so, please do correct me. But my reading of it, is that the experiments and evaluations are insufficient to formulate the conclusion made. I think the results even make sense with Chomsky's claim. (I'll stick to the random shuffle for clarity)

It does not appear that the evaluations are considering all possible valid outputs for the next token. Perplexity is not actually a measure of language performance, though it is wonderful that it has worked out so well so far (I suspect due to the structure in languages). The perplexity being higher is not necessarily indicative of poorer performance. I view this as analogous to sequences of coin flips (our natural language) to sequences of dice rolls (our shuffle). One naturally has more randomness than another. A model that successfully learns the former will have lower perplexity than the model that learns the latter.

To properly evaluate we need to consider if the model is able to produce valid sentences, and consistently. With our coin and dice analogy let's assume we have a sequence of 3 events. Our model conditions on a single flip of heads and we can estimate likelihoods for the sequences HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. Our successful model will tell us that the last 4 are not possible, but that the others are equally likely. Now if we compare to a dice roll, conditioned on a roll of a 1, then the model is not invalid for suggesting higher entropy. That is exactly what we want our model to do. There are just more _valid_ answers. In the same way if we're predicting (conditionally) next token, then we should expect a higher perplexity in the "more impossible" languages, but that does not tell us the success of learning the language (I would also expect these models to take longer to converge due to this, just as with coins and dice. I'll leave "learn just as well" to Chomsky, as this is ambiguous).

Entropy isn't enough. Our metric needs to be based on the mass distribution. To compare against one another, we'd have to normalize the values to their distributions. A direct comparison to one another will always lead to the random shuffle model having higher perplexity (just as with coins and dice), so it is an unfair comparison. Without the normalization we'd expect to find exactly what is shown in Figure 2.

As I understand the writing and the code, you do not compare against all valid tokens, but rather the fixed ones. I'm just seeing the perplexities counted in the usual way (I see loop over batch, but not for valid permutations). I see the line in the text

  > dataset shuffling during training.

So I assume that this means the dataloader is shuffling the selected sentences? I don't see this in the code but I'm happy to trust you if you say yes. But the code makes me think this was generated beforehand (I'm having dependency issues so can't verify). But if you are generating the perturbations beforehand, then I think the results are irrelevant because you haven't been implicitly teaching the model that ordering doesn't matter. The fact that results get worse for the models without positional encoding is suggestive of concern here. If position does not matter, why does the positional information increase the model's ability to learn? It should be irrelevant to a non-deterministic shuffled language. I am also suspect since the "no shuffle" model appears to have identical learning capabilities w.r.t Fig 2 and 6. (I'm also seeing a lot of reference to error bars but it isn't clear to me what the variance is. Is the bar smaller than the markers? Scaling could really help here as well as placing horizontal bars at the bounds given the visualization of the markers in the legends).

As for limitations, I am also suspect there's a bias introduced due to tokenization. Since the tokenization embedding is generated from the expected ordering. I think this adds additional complexity that could be reduced, but not eliminated, by shuffling words instead of tokens. Not eliminated because tokens are only dependent on single words, but the sentences themselves. Word pairs and sequences matter.

Fwiw, I don't agree with Chomsky. Clearly LLMs are extracting structure in language and I think it is obtuse to claim that a system designed for pattern matching won't identify these patterns. One doesn't need reasoning or abstraction to converge to this, one simply needs sufficient sampling and for structure to exist. Clearly structure exists in the language, so we should expect a sufficient pattern matcher to be able to extract these patterns.

replies(2): >>canjob+7d1 >>foobar+1o3

>>godels+mH
Thanks for the feedback! The point about perplexity is totally valid for the nondeterministic shuffle baseline. This seems to have misled a lot of people. But for all the other baselines, we're applying some one-to-one transformation function to the original training set, and so not increasing the inherent entropy of the distribution being learned.

As for tokenization, good point: it's worth retokenizing based on the altered datasets to see if that changes anything. I think it might not, because the tokenizers we use are based on the frequency distribution of "words" identified by whitespaces, but we'll have to check.

replies(1): >>godels+zA1

>>canjob+7d1
Thanks for the reply!

  > one-to-one transformation function to the original training set, and so not increasing the inherent entropy of the _distribution being learned_.

I disagree. Entropy of the model? The language? The sentence? The tokens? The distinction is subtle but important. A one-to-one mapping is not structure preserving. For a trivial example: {a,b,c} -> {c,a,b} doesn't preserve alphabetical ordering. The distribution the LM is learning is that of the intractable(?) distribution of the language itself. Certainty the entropy here changes, and I believe your results demonstrate this. I think the entropy would only be the same if we preserved all structure[0], but my understanding of the impossible languages is that they remove (all) structure. I'm not sure if that'd yield worthwhile results, but I think it could be a good sanity check -- or at least assumption check -- to do deterministic permutations based on syntactic structure. E.g. replace all S,V,O -> O,V,S for all sentences.

  > because the tokenizers we use are based on the frequency distribution of "words" identified by whitespaces, but we'll have to check.

Maybe I was mislead by Table 1. Noting that "bookshelf" -> {books,he,lf}. That isn't whitespace delimited. The examples only show "bookshelf" being preserved in the Shuffle (s=21) and the HOP cases (well... split by the hop token). Partial reverse showing {books, [R], ., lf, he}. I think this is a good example of token bias as our token "he" can hold multiple meanings, but I suspect that the statistics changes when we consider the word structure and how that these tokens will appear conditionally. I think this is a good example where I think Figure 2 doesn't do a great job at proving the conclusion. The differences are small, so are they completely offset by the bias? HOP seems even more convincing as the results converge and this bias should be much more easily accounted for. What is unclear to me here is why there's a significant difference in the beginning of training. I am a bit surprised that training from scratch would have this initialization bias.

(Also, just to note, I wouldn't reject this paper were it to come across my desk, but I bias towards accepting works. I am picking on you because you won best paper and a lot of the narrative that formed around the paper)

[0] I think we have "natural experiments" here w.r.t learning other languages and translation. Though not al structure is preserved. Some are lost and some are gained. But this again can be affected by the tokenizer and if you are doing things like stripping accent. Clearly that removes structure, but isn't going to have a big effect on English.

replies(1): >>canjob+cZ1

>>godels+zA1
> A one-to-one mapping is not structure preserving.

That's just the point we were making: if you mess with the structure of natural language, then language models don't learn it as well any more.

The transformations do preserve the entropy in the sense that the lowest achievable perplexity is the same for everything except the nondeterministic shuffle. Since applying a one-to-one function preserves the entropy of a discrete random variable, in this case the random variable over documents in the training and test sets. In principle, a universal approximator should be able to learn to invert any one-to-one transformation we apply, although of course in practice a GPT architecture doesn't achieve this.

The transformations do mess up the local entropy in strings. But I think that's part of the point. Human language seems to be structured in a way so that things are locally predictable. When you screw up that structure, then languages are harder to learn.

We are working on a followup with more transformations including syntactic ones, as you might imagine. It's surprisingly hard to come up with manipulations that target specific aspect of linguistic structure, while fitting the criteria that (1) they don't affect the lowest achievable perplexity, (2) they clearly change the language from "possible" to "impossible" in a way that all or most linguists would agree with, (3) they can actually be implemented using the data we have---for example a transformation that relies on detailed syntactic parses would require that we parse the whole dataset, which is not only time consuming but also introduces possible confounds from errors in the parser, etc. We're talking to a lot of people, if you have ideas we'd be happy to hear them!

replies(2): >>foobar+Ym3 >>godels+vJ3

>>canjob+(OP)
The real problem with the paper is not any of the mathematical details that others have described it is more fundamental. Chomsky's claim is that humans have a distinctive property that they seem to not be able to process certain synthetic language constructions --- namely linear (non-hierarchical) languages --- as well as synthetic human-like (hierarchical) languages and they use a different part of the brain to do so. This was shown in experiments (see Moro, Secrets of Words, I think his nature paper also cites the studies).

Because the synthetic linear languages are computationally/structurally simple LLMs will, unlike humans, learn them just as easily as real human languages. Since this hierarchical aspect of human language seems fundamental/important LLMs therefore are not a good model of the human language faculty.

If you want to refute that claim then you would take similar synthetic language constructions to those that were used in the experiments and show that LLMs take longer to learn them.

Instead you mostly created an abstraction of the problem that no one cares about: that there exist certain synthetic language constructions that LLMs have difficulty with. But this is both trivial (consider a language that requires you to factor numbers to decode it) and irrelevant (there is no relation to what humans do except in an abstract sense).

The one language that you use that is most similar to the linear languages cited by Moro, "Hop", shows very little difference in performance, directly undermining your claimed refutation of Chomsky.

replies(1): >>canjob+dU5

>>canjob+cZ1
As I said in another comment the only relevant synthetic language that would refute Chomsky's claim are the ones we have human experiments for. Specifically those of Moro.

I believe the relevant papers are referenced here on page 4. (Tettamanti et al., 2002; Musso et al., 2003; Moro, 2016)

https://acesin.letras.ufrj.br/wp-content/uploads/2024/02/Mor...

>>godels+mH
> Fwiw, I don't agree with Chomsky. Clearly LLMs are extracting structure in language and I think it is obtuse to claim that a system designed for pattern matching won't identify these patterns. One doesn't need reasoning or abstraction to converge to this, one simply needs sufficient sampling and for structure to exist. Clearly structure exists in the language, so we should expect a sufficient pattern matcher to be able to extract these patterns.

Chomsky has never said that LLMs can't extract patterns from language. His point is that humans have trouble processing certain language patterns while LLMs don't, which means that LLMs work differently and therefore can't shed any light on humans.

>>canjob+cZ1
I was going to write more but I wanted to just simplify the comment. I think we agree more than disagree and that I've not communicated effectively. So I want to focus on my main point about the metric (and one other part).

In the intro when you reference Chomsky it says

  | [Chomsky] make very broad claims to the effect that large language models (LLMs) are equally capable of learning possible and impossible human languages.

My objection here is measuring success of learning a language, not to the difficulty of the learning process.

What I'm trying to say is that in the shuffled languages that when we are doing next token prediction, the perplexity is necessarily higher. This must be true because of the destruction of local structure. BUT this does not mean that the language wasn't learned successfully. For example, in the next token setting, if we are predicting the sentence "My name is godelski" then certainly the perplexity is higher when "Name godelski my is", "godelski name is my", etc are also valid sequences. The perplexity is higher BUT the language is successfully learned.

My point is that we have to be careful about how we define what it means for the language to be successfully learned.

(I'm not sure there is a tractable measure for the success of learning a language, I don't know of a proof in either direction. But I do know that perplexity is a proxy and we must be careful about the alignment of any measures, as there is always a difference. Even in seemingly trivial things we cannot measure anything directly, it is always indirect (e.g. we measure distance with a ruler, which is an approximation based on the definition of a meter, but is not a meter itself. Though this is going to typically be very well aligned))

  > a universal approximator should be able to learn to invert any one-to-one transformation we apply

I agree, but I think we need to be a bit careful here. Extending universal approximation theorem to discrete distributions introduces new conditions, which the usual form is that the function we're approximating must be continuous, closed, and bounded. But I think we also need to be careful with how we look at complexity. Yes, a bijective function is going to have the same complexity in both directions, but this will not hold if there is any filtering. But the part where we really have to be careful about is the difference in difficulty of _learning_ D and _learning_ T(D). Even if T is simple, these learning processes may not be equally simple. It's subtle but I believe important. As a clear example of this, we will often scale data (one might call this normalization, but let's not be ambiguous and let's make sure it is bijective), and it will be clear that learning the scaled data is simpler than learning the unscaled data. So while yes, a universal approximator is __capable__ of learning to invert any bijection, this does not mean that the difficulty in __learning__ to invert it is easy.

I do really appreciate the chat and the explanations. I'm glad to know that there is a followup and I'm interested to see the results.

>>foobar+1l3
> Instead you mostly created an abstraction of the problem that no one cares about: that there exist certain synthetic language constructions that LLMs have difficulty with. But this is both trivial (consider a language that requires you to factor numbers to decode it) and irrelevant (there is no relation to what humans do except in an abstract sense).

Thanks for your feedback. I think our manipulations do establish that there are nontrivial inductive biases in Transformer language models and that these inductive biases are aligned with human language in important ways. There's no universal a priori sense in which Moro's linear counting languages are "simple" but our deterministically shuffled languages aren't. It seems that GPT language models do favor real language over the perturbed ones, and this shows that they have a simplicity bias which aligns with human language. This is remarkable, considering that the GPT architecture doesn't look like what one would expect based on existing linguistic theory.

Furthermore, this alignment is interesting even if it isn't perfect. I would be shocked in GPT language models happened to have inductive biases that perfectly match the structure of human language---why would they? But it is still worthwhile to probe what those inductive biases are and to compare them with what humans do. As a comparison, context-free grammars turned out to be an imperfect model of syntax, but the field of syntax benefited a lot from exploring them and their limits. Something similar is happening now with neural language models as models of language learning and processing, a very active research field. So I wouldn't say that neural language models can't shed any light on language simply because they're not a perfect match for a particular aspect of language.

As for using languages more directly based on the Moro experiments, we've discussed this extensively. There are nontrivial challenges in scaling those languages up to the point that you can have a realistic training set, where the control condition is a real language instead of a toy language, without introducing confounds of various kinds. We're open to suggestions. We've had very productive conversations with syntacticians about how to formulate new baselines in future work.

More generally our goal was to get formal linguists more interested in defining the impossible vs. possible language distinction more carefully, to the point that they can be used to test the inductive biases of neural models. It's not as simple as hierarchical vs. linear, since there are purely linear phenomena in syntax such as Closest Conjunct Agreement, and also morphophonological processes can act linearly across constituent boundaries, among other complications.

> The one language that you use that is most similar to the linear languages cited by Moro, "Hop", shows very little difference in performance, directly undermining your claimed refutation of Chomsky.

I wouldn't read much into the magnitude of the difference between NoHop and Hop, because the Hop transformation only affects a small number of sentences, and the perplexity metric is an average over sentences.

replies(1): >>foobar+bP7

>>canjob+dU5
> these inductive biases are aligned with human language in important ways.

They aren’t, which is the entire point of this conversation, and simply asserting otherwise isn’t an argument.

> It seems that GPT language models do favor real language over the perturbed ones, and this shows that they have a simplicity bias which aligns with human language. This is remarkable, considering that the GPT architecture doesn't look like what one would expect based on existing linguistic theory.

This is a non-sensical argument: consider if you had studied a made up language that required you to factor numbers or do something else inherently computationally expensive. LLMs would favor simplicity bias “just like humans” but it’s obvious this doesn’t tell you anything and specifically doesn’t tell you that LLMs are like humans in any useful sense.

> There's no universal a priori sense in which Moro's linear counting languages are "simple" but our deterministically shuffled languages aren't.

You are missing the point, which is that humans cannot as easily learn Moro languages while LLMs can. Therefore LLMs are different in a fundamental way from humans. This difference is so fundamental that you need to give strong, specific, explicit justification why LLMs are useful in explaining humans. The only reason I used the word “simple” is to argue that LLMs would be able to learn it easily (without even having to run an experiment) but the same would be true if LLMs learned a non-simple language that humans couldn’t.

Again it doesn’t matter if you find all the ways that humans and LLMs are the same —- for example that they both struggle with shuffled sentences or with a language that involves factoring numbers —— what matters is that there exists a fundamental difference between them exemplified by the Moro languages.

> But it is still worthwhile to probe what those inductive biases are and to compare them with what humans do.

Why? There is no reason to believe you will learn anything from it. This is a bizarre abstract argument that doing something is useful because you might learn something from it. You can say that about anything you do. There is a video on YouTube where Chomsky engages with someone making similar arguments about chess computers. Chomsky said that there wasn’t any self evident reason why studying chess playing computers would tell you anything about humans. He was correct, we never did learn anything significant about humans from chess computers.

> As a comparison, context-free grammars turned out to be an imperfect model of syntax, but the field of syntax benefited a lot from exploring them and their limits.

There is a difference between pursuing a reasonable line of inquiry and having it fail versus pursuing one that you know or ought to know is flawed. If someone had pointed out the problems with CFG at the outset it would have been foolish to pursue it, just as it is foolish to ignore the Moro problem now.

> There are nontrivial challenges in scaling those languages up to the point that you can have a realistic training set

I can’t imagine what those challenges are, I don’t remember the details but I believe Moro made systematic simple grammar changes. Your Hop is in the same vein.

> where the control condition is a real language

Why does the control need to be a real language? Moro did not use a real language control on humans. (Edit: Because you want to use pre-trained models?).

> More generally our goal was to get formal linguists more interested in defining the impossible vs. possible language distinction more carefully

Again you’ve invented an abstract problem to study that has no bearing on the problem that Chomsky has described. Moro showed that humans struggle with certain synthetic grammar constructions. Chomsky noted that LLMs do not have this important feature. You are now trying to take this concrete observation about humans and turning it into the abstract field of the study of “impossible languages”.

> It's not as simple as hierarchical vs. linear

There are different aspects of language but there is a characteristic feature missing from LLMs which makes them unsuitable as models for human language. It doesn’t make any sense for a linguist to care about LLMs unless you provide justification for why they would learn anything about the human language faculty from LLMs despite that fundamental difference.

> I wouldn't read much into the magnitude of the difference between NoHop and Hop, because the Hop transformation only affects a small number of sentences, and the perplexity metric is an average over sentences

Even if this were true we return to “no evidence” rather than “evidence against”. But it is very unlikely that Moro-languages are any more difficult for LLMs to learn because, as I said earlier, they are very computationally simple, simpler than hierarchical languages.