I swear, the big reason models are black boxes are because we _want_ them to be. There's clear anti-sentiment mentality against people doing theory and the result of this shows. I remember not too long ago Yi Tay (under @agihippo but main is @YiTayML) said "fuck theorists". I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.
Also, I'd like to point out, the author uses "we" but the paper only has one author on it. So may I suggest adding their cat as a coauthor? [0]
I disagree with the conclusions of the paper. Maybe I have some misunderstandings, and if so, please do correct me. But my reading of it, is that the experiments and evaluations are insufficient to formulate the conclusion made. I think the results even make sense with Chomsky's claim. (I'll stick to the random shuffle for clarity)
It does not appear that the evaluations are considering all possible valid outputs for the next token. Perplexity is not actually a measure of language performance, though it is wonderful that it has worked out so well so far (I suspect due to the structure in languages). The perplexity being higher is not necessarily indicative of poorer performance. I view this as analogous to sequences of coin flips (our natural language) to sequences of dice rolls (our shuffle). One naturally has more randomness than another. A model that successfully learns the former will have lower perplexity than the model that learns the latter.
To properly evaluate we need to consider if the model is able to produce valid sentences, and consistently. With our coin and dice analogy let's assume we have a sequence of 3 events. Our model conditions on a single flip of heads and we can estimate likelihoods for the sequences HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. Our successful model will tell us that the last 4 are not possible, but that the others are equally likely. Now if we compare to a dice roll, conditioned on a roll of a 1, then the model is not invalid for suggesting higher entropy. That is exactly what we want our model to do. There are just more _valid_ answers. In the same way if we're predicting (conditionally) next token, then we should expect a higher perplexity in the "more impossible" languages, but that does not tell us the success of learning the language (I would also expect these models to take longer to converge due to this, just as with coins and dice. I'll leave "learn just as well" to Chomsky, as this is ambiguous).
Entropy isn't enough. Our metric needs to be based on the mass distribution. To compare against one another, we'd have to normalize the values to their distributions. A direct comparison to one another will always lead to the random shuffle model having higher perplexity (just as with coins and dice), so it is an unfair comparison. Without the normalization we'd expect to find exactly what is shown in Figure 2.
As I understand the writing and the code, you do not compare against all valid tokens, but rather the fixed ones. I'm just seeing the perplexities counted in the usual way (I see loop over batch, but not for valid permutations). I see the line in the text
> dataset shuffling during training.
So I assume that this means the dataloader is shuffling the selected sentences? I don't see this in the code but I'm happy to trust you if you say yes. But the code makes me think this was generated beforehand (I'm having dependency issues so can't verify). But if you are generating the perturbations beforehand, then I think the results are irrelevant because you haven't been implicitly teaching the model that ordering doesn't matter. The fact that results get worse for the models without positional encoding is suggestive of concern here. If position does not matter, why does the positional information increase the model's ability to learn? It should be irrelevant to a non-deterministic shuffled language. I am also suspect since the "no shuffle" model appears to have identical learning capabilities w.r.t Fig 2 and 6. (I'm also seeing a lot of reference to error bars but it isn't clear to me what the variance is. Is the bar smaller than the markers? Scaling could really help here as well as placing horizontal bars at the bounds given the visualization of the markers in the legends).As for limitations, I am also suspect there's a bias introduced due to tokenization. Since the tokenization embedding is generated from the expected ordering. I think this adds additional complexity that could be reduced, but not eliminated, by shuffling words instead of tokens. Not eliminated because tokens are only dependent on single words, but the sentences themselves. Word pairs and sequences matter.
Fwiw, I don't agree with Chomsky. Clearly LLMs are extracting structure in language and I think it is obtuse to claim that a system designed for pattern matching won't identify these patterns. One doesn't need reasoning or abstraction to converge to this, one simply needs sufficient sampling and for structure to exist. Clearly structure exists in the language, so we should expect a sufficient pattern matcher to be able to extract these patterns.
Chomsky has never said that LLMs can't extract patterns from language. His point is that humans have trouble processing certain language patterns while LLMs don't, which means that LLMs work differently and therefore can't shed any light on humans.