There is also the counter-intuitive phenomenon where training a model on a wider variety of content than apparently necessary for the task makes it better somehow. For example, models trained only on English content exhibit measurably worse performance at writing sensible English than those trained on a handful of languages, even when controlling for the size of the training set. It doesn't make sense to me, but it probably does to credentialed AI researchers who know what's going on under the hood.
i.e. there is a lot of commonality between programming languages just as there is between human languages, so training on one language would be beneficial to competency in other languages.
I assumed that is what was catered for with "even when controlling for the size of the training set".
I.e. assuming I am reading it right: That it is better to get the same data as 25% in 4 languages, than 100% in one language.
To do well as an LLM you want to end up with the weights that gets furthest in the direction of "reasoning".
So assume that with just one language there's a possibility to get stuck in local optima of weights that do well on the English test set but which doesn't reason well.
If you then take the same model size but it has to manage to learn several languages, with the same number of weights, this would eliminate a lot of those local optima because if you don't manage to get the weights into a regime where real reasoning/deeper concepts is "understood" then it's not possible to do well with several languages with the same number of weights.
And if you speak several languages that would naturally bring in more abstraction, that the concept of "cat" is different from the word "cat" in a given language, and so on.