A statistical approach to model evaluations

>>RobinH+(OP)
This does feel a bit like under grad introduction to statistical analysis and surprising anyone felt the need to explain these things. But I also suspect most AI people out there now a days have limited math skills so maybe it’s helpful?

>>fnordp+Are
As an ML researcher who started in physics (this seems common among physics/math turned ML people. Which Evan is included), I cannot tell you how bad is it... One year at CVPR when diffusion models hit the scenes I was asking what people's covariance was (I had overestimated the model complexity), and the most common answer I got was "how do I calculate that?" People do not understand things like what "pdf" means. People at top schools! I've been told I'm "gatekeeping" for saying that you should learn math (I say "you don't need math to build good models, but you do to understand why they're wrong"). Not that you need to, but should. (I guess this explains why Mission Impossible Language Models won best paper...)

I swear, the big reason models are black boxes are because we _want_ them to be. There's clear anti-sentiment mentality against people doing theory and the result of this shows. I remember not too long ago Yi Tay (under @agihippo but main is @YiTayML) said "fuck theorists". I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.

Also, I'd like to point out, the author uses "we" but the paper only has one author on it. So may I suggest adding their cat as a coauthor? [0]

[0] https://en.wikipedia.org/wiki/F._D._C._Willard

>>godels+5Ke
What's your objection to Mission Impossible Language Models?

>>canjob+yQe
I see you're one of the authors.

I disagree with the conclusions of the paper. Maybe I have some misunderstandings, and if so, please do correct me. But my reading of it, is that the experiments and evaluations are insufficient to formulate the conclusion made. I think the results even make sense with Chomsky's claim. (I'll stick to the random shuffle for clarity)

It does not appear that the evaluations are considering all possible valid outputs for the next token. Perplexity is not actually a measure of language performance, though it is wonderful that it has worked out so well so far (I suspect due to the structure in languages). The perplexity being higher is not necessarily indicative of poorer performance. I view this as analogous to sequences of coin flips (our natural language) to sequences of dice rolls (our shuffle). One naturally has more randomness than another. A model that successfully learns the former will have lower perplexity than the model that learns the latter.

To properly evaluate we need to consider if the model is able to produce valid sentences, and consistently. With our coin and dice analogy let's assume we have a sequence of 3 events. Our model conditions on a single flip of heads and we can estimate likelihoods for the sequences HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. Our successful model will tell us that the last 4 are not possible, but that the others are equally likely. Now if we compare to a dice roll, conditioned on a roll of a 1, then the model is not invalid for suggesting higher entropy. That is exactly what we want our model to do. There are just more _valid_ answers. In the same way if we're predicting (conditionally) next token, then we should expect a higher perplexity in the "more impossible" languages, but that does not tell us the success of learning the language (I would also expect these models to take longer to converge due to this, just as with coins and dice. I'll leave "learn just as well" to Chomsky, as this is ambiguous).

Entropy isn't enough. Our metric needs to be based on the mass distribution. To compare against one another, we'd have to normalize the values to their distributions. A direct comparison to one another will always lead to the random shuffle model having higher perplexity (just as with coins and dice), so it is an unfair comparison. Without the normalization we'd expect to find exactly what is shown in Figure 2.

As I understand the writing and the code, you do not compare against all valid tokens, but rather the fixed ones. I'm just seeing the perplexities counted in the usual way (I see loop over batch, but not for valid permutations). I see the line in the text

  > dataset shuffling during training.

So I assume that this means the dataloader is shuffling the selected sentences? I don't see this in the code but I'm happy to trust you if you say yes. But the code makes me think this was generated beforehand (I'm having dependency issues so can't verify). But if you are generating the perturbations beforehand, then I think the results are irrelevant because you haven't been implicitly teaching the model that ordering doesn't matter. The fact that results get worse for the models without positional encoding is suggestive of concern here. If position does not matter, why does the positional information increase the model's ability to learn? It should be irrelevant to a non-deterministic shuffled language. I am also suspect since the "no shuffle" model appears to have identical learning capabilities w.r.t Fig 2 and 6. (I'm also seeing a lot of reference to error bars but it isn't clear to me what the variance is. Is the bar smaller than the markers? Scaling could really help here as well as placing horizontal bars at the bounds given the visualization of the markers in the legends).

As for limitations, I am also suspect there's a bias introduced due to tokenization. Since the tokenization embedding is generated from the expected ordering. I think this adds additional complexity that could be reduced, but not eliminated, by shuffling words instead of tokens. Not eliminated because tokens are only dependent on single words, but the sentences themselves. Word pairs and sequences matter.

Fwiw, I don't agree with Chomsky. Clearly LLMs are extracting structure in language and I think it is obtuse to claim that a system designed for pattern matching won't identify these patterns. One doesn't need reasoning or abstraction to converge to this, one simply needs sufficient sampling and for structure to exist. Clearly structure exists in the language, so we should expect a sufficient pattern matcher to be able to extract these patterns.

>>godels+Uxf
Thanks for the feedback! The point about perplexity is totally valid for the nondeterministic shuffle baseline. This seems to have misled a lot of people. But for all the other baselines, we're applying some one-to-one transformation function to the original training set, and so not increasing the inherent entropy of the distribution being learned.

As for tokenization, good point: it's worth retokenizing based on the altered datasets to see if that changes anything. I think it might not, because the tokenizers we use are based on the frequency distribution of "words" identified by whitespaces, but we'll have to check.

>>canjob+F3g
Thanks for the reply!

  > one-to-one transformation function to the original training set, and so not increasing the inherent entropy of the _distribution being learned_.

I disagree. Entropy of the model? The language? The sentence? The tokens? The distinction is subtle but important. A one-to-one mapping is not structure preserving. For a trivial example: {a,b,c} -> {c,a,b} doesn't preserve alphabetical ordering. The distribution the LM is learning is that of the intractable(?) distribution of the language itself. Certainty the entropy here changes, and I believe your results demonstrate this. I think the entropy would only be the same if we preserved all structure[0], but my understanding of the impossible languages is that they remove (all) structure. I'm not sure if that'd yield worthwhile results, but I think it could be a good sanity check -- or at least assumption check -- to do deterministic permutations based on syntactic structure. E.g. replace all S,V,O -> O,V,S for all sentences.

  > because the tokenizers we use are based on the frequency distribution of "words" identified by whitespaces, but we'll have to check.

Maybe I was mislead by Table 1. Noting that "bookshelf" -> {books,he,lf}. That isn't whitespace delimited. The examples only show "bookshelf" being preserved in the Shuffle (s=21) and the HOP cases (well... split by the hop token). Partial reverse showing {books, [R], ., lf, he}. I think this is a good example of token bias as our token "he" can hold multiple meanings, but I suspect that the statistics changes when we consider the word structure and how that these tokens will appear conditionally. I think this is a good example where I think Figure 2 doesn't do a great job at proving the conclusion. The differences are small, so are they completely offset by the bias? HOP seems even more convincing as the results converge and this bias should be much more easily accounted for. What is unclear to me here is why there's a significant difference in the beginning of training. I am a bit surprised that training from scratch would have this initialization bias.

(Also, just to note, I wouldn't reject this paper were it to come across my desk, but I bias towards accepting works. I am picking on you because you won best paper and a lot of the narrative that formed around the paper)

[0] I think we have "natural experiments" here w.r.t learning other languages and translation. Though not al structure is preserved. Some are lost and some are gained. But this again can be affected by the tokenizer and if you are doing things like stripping accent. Clearly that removes structure, but isn't going to have a big effect on English.

>>godels+7rg
> A one-to-one mapping is not structure preserving.

That's just the point we were making: if you mess with the structure of natural language, then language models don't learn it as well any more.

The transformations do preserve the entropy in the sense that the lowest achievable perplexity is the same for everything except the nondeterministic shuffle. Since applying a one-to-one function preserves the entropy of a discrete random variable, in this case the random variable over documents in the training and test sets. In principle, a universal approximator should be able to learn to invert any one-to-one transformation we apply, although of course in practice a GPT architecture doesn't achieve this.

The transformations do mess up the local entropy in strings. But I think that's part of the point. Human language seems to be structured in a way so that things are locally predictable. When you screw up that structure, then languages are harder to learn.

We are working on a followup with more transformations including syntactic ones, as you might imagine. It's surprisingly hard to come up with manipulations that target specific aspect of linguistic structure, while fitting the criteria that (1) they don't affect the lowest achievable perplexity, (2) they clearly change the language from "possible" to "impossible" in a way that all or most linguists would agree with, (3) they can actually be implemented using the data we have---for example a transformation that relies on detailed syntactic parses would require that we parse the whole dataset, which is not only time consuming but also introduces possible confounds from errors in the parser, etc. We're talking to a lot of people, if you have ideas we'd be happy to hear them!

>>canjob+KPg
As I said in another comment the only relevant synthetic language that would refute Chomsky's claim are the ones we have human experiments for. Specifically those of Moro.

I believe the relevant papers are referenced here on page 4. (Tettamanti et al., 2002; Musso et al., 2003; Moro, 2016)

https://acesin.letras.ufrj.br/wp-content/uploads/2024/02/Mor...

zlacker