> one-to-one transformation function to the original training set, and so not increasing the inherent entropy of the _distribution being learned_.
I disagree. Entropy of the model? The language? The sentence? The tokens? The distinction is subtle but important. A one-to-one mapping is not structure preserving. For a trivial example: {a,b,c} -> {c,a,b} doesn't preserve alphabetical ordering. The distribution the LM is learning is that of the intractable(?) distribution of the language itself. Certainty the entropy here changes, and I believe your results demonstrate this. I think the entropy would only be the same if we preserved all structure[0], but my understanding of the impossible languages is that they remove (all) structure. I'm not sure if that'd yield worthwhile results, but I think it could be a good sanity check -- or at least assumption check -- to do deterministic permutations based on syntactic structure. E.g. replace all S,V,O -> O,V,S for all sentences. > because the tokenizers we use are based on the frequency distribution of "words" identified by whitespaces, but we'll have to check.
Maybe I was mislead by Table 1. Noting that "bookshelf" -> {books,he,lf}. That isn't whitespace delimited. The examples only show "bookshelf" being preserved in the Shuffle (s=21) and the HOP cases (well... split by the hop token). Partial reverse showing {books, [R], ., lf, he}. I think this is a good example of token bias as our token "he" can hold multiple meanings, but I suspect that the statistics changes when we consider the word structure and how that these tokens will appear conditionally. I think this is a good example where I think Figure 2 doesn't do a great job at proving the conclusion. The differences are small, so are they completely offset by the bias? HOP seems even more convincing as the results converge and this bias should be much more easily accounted for. What is unclear to me here is why there's a significant difference in the beginning of training. I am a bit surprised that training from scratch would have this initialization bias.(Also, just to note, I wouldn't reject this paper were it to come across my desk, but I bias towards accepting works. I am picking on you because you won best paper and a lot of the narrative that formed around the paper)
[0] I think we have "natural experiments" here w.r.t learning other languages and translation. Though not al structure is preserved. Some are lost and some are gained. But this again can be affected by the tokenizer and if you are doing things like stripping accent. Clearly that removes structure, but isn't going to have a big effect on English.
That's just the point we were making: if you mess with the structure of natural language, then language models don't learn it as well any more.
The transformations do preserve the entropy in the sense that the lowest achievable perplexity is the same for everything except the nondeterministic shuffle. Since applying a one-to-one function preserves the entropy of a discrete random variable, in this case the random variable over documents in the training and test sets. In principle, a universal approximator should be able to learn to invert any one-to-one transformation we apply, although of course in practice a GPT architecture doesn't achieve this.
The transformations do mess up the local entropy in strings. But I think that's part of the point. Human language seems to be structured in a way so that things are locally predictable. When you screw up that structure, then languages are harder to learn.
We are working on a followup with more transformations including syntactic ones, as you might imagine. It's surprisingly hard to come up with manipulations that target specific aspect of linguistic structure, while fitting the criteria that (1) they don't affect the lowest achievable perplexity, (2) they clearly change the language from "possible" to "impossible" in a way that all or most linguists would agree with, (3) they can actually be implemented using the data we have---for example a transformation that relies on detailed syntactic parses would require that we parse the whole dataset, which is not only time consuming but also introduces possible confounds from errors in the parser, etc. We're talking to a lot of people, if you have ideas we'd be happy to hear them!
I believe the relevant papers are referenced here on page 4. (Tettamanti et al., 2002; Musso et al., 2003; Moro, 2016)
https://acesin.letras.ufrj.br/wp-content/uploads/2024/02/Mor...
In the intro when you reference Chomsky it says
| [Chomsky] make very broad claims to the effect that large language models (LLMs) are equally capable of learning possible and impossible human languages.
My objection here is measuring success of learning a language, not to the difficulty of the learning process.What I'm trying to say is that in the shuffled languages that when we are doing next token prediction, the perplexity is necessarily higher. This must be true because of the destruction of local structure. BUT this does not mean that the language wasn't learned successfully. For example, in the next token setting, if we are predicting the sentence "My name is godelski" then certainly the perplexity is higher when "Name godelski my is", "godelski name is my", etc are also valid sequences. The perplexity is higher BUT the language is successfully learned.
My point is that we have to be careful about how we define what it means for the language to be successfully learned.
(I'm not sure there is a tractable measure for the success of learning a language, I don't know of a proof in either direction. But I do know that perplexity is a proxy and we must be careful about the alignment of any measures, as there is always a difference. Even in seemingly trivial things we cannot measure anything directly, it is always indirect (e.g. we measure distance with a ruler, which is an approximation based on the definition of a meter, but is not a meter itself. Though this is going to typically be very well aligned))
> a universal approximator should be able to learn to invert any one-to-one transformation we apply
I agree, but I think we need to be a bit careful here. Extending universal approximation theorem to discrete distributions introduces new conditions, which the usual form is that the function we're approximating must be continuous, closed, and bounded. But I think we also need to be careful with how we look at complexity. Yes, a bijective function is going to have the same complexity in both directions, but this will not hold if there is any filtering. But the part where we really have to be careful about is the difference in difficulty of _learning_ D and _learning_ T(D). Even if T is simple, these learning processes may not be equally simple. It's subtle but I believe important. As a clear example of this, we will often scale data (one might call this normalization, but let's not be ambiguous and let's make sure it is bijective), and it will be clear that learning the scaled data is simpler than learning the unscaled data. So while yes, a universal approximator is __capable__ of learning to invert any bijection, this does not mean that the difficulty in __learning__ to invert it is easy.I do really appreciate the chat and the explanations. I'm glad to know that there is a followup and I'm interested to see the results.