zlacker

I was going to write more but I wanted to just simplify the comment. I think we agree more than disagree and that I've not communicated effectively. So I want to focus on my main point about the metric (and one other part).

In the intro when you reference Chomsky it says

  | [Chomsky] make very broad claims to the effect that large language models (LLMs) are equally capable of learning possible and impossible human languages.

My objection here is measuring success of learning a language, not to the difficulty of the learning process.

What I'm trying to say is that in the shuffled languages that when we are doing next token prediction, the perplexity is necessarily higher. This must be true because of the destruction of local structure. BUT this does not mean that the language wasn't learned successfully. For example, in the next token setting, if we are predicting the sentence "My name is godelski" then certainly the perplexity is higher when "Name godelski my is", "godelski name is my", etc are also valid sequences. The perplexity is higher BUT the language is successfully learned.

My point is that we have to be careful about how we define what it means for the language to be successfully learned.

(I'm not sure there is a tractable measure for the success of learning a language, I don't know of a proof in either direction. But I do know that perplexity is a proxy and we must be careful about the alignment of any measures, as there is always a difference. Even in seemingly trivial things we cannot measure anything directly, it is always indirect (e.g. we measure distance with a ruler, which is an approximation based on the definition of a meter, but is not a meter itself. Though this is going to typically be very well aligned))

  > a universal approximator should be able to learn to invert any one-to-one transformation we apply

I agree, but I think we need to be a bit careful here. Extending universal approximation theorem to discrete distributions introduces new conditions, which the usual form is that the function we're approximating must be continuous, closed, and bounded. But I think we also need to be careful with how we look at complexity. Yes, a bijective function is going to have the same complexity in both directions, but this will not hold if there is any filtering. But the part where we really have to be careful about is the difference in difficulty of _learning_ D and _learning_ T(D). Even if T is simple, these learning processes may not be equally simple. It's subtle but I believe important. As a clear example of this, we will often scale data (one might call this normalization, but let's not be ambiguous and let's make sure it is bijective), and it will be clear that learning the scaled data is simpler than learning the unscaled data. So while yes, a universal approximator is __capable__ of learning to invert any bijection, this does not mean that the difficulty in __learning__ to invert it is easy.

I do really appreciate the chat and the explanations. I'm glad to know that there is a followup and I'm interested to see the results.