zlacker

This is incorrect. For example the ability to translate between languages is emergent. Also gpt4 can do arithmetic better than the average person. Especially considering the process it arrives at the computation is via intuition basically vs algorithmic. Btw just as an aide the newer models can also write code to do certain tasks, like arithmetic.

replies(2): >>sgt101+u82 >>james-+b45

>>Davidz+(OP)
Language translation is due to the huge corpus of translations that it's trained on. Google translate has been doing this for years. People don't apply softmax to their arithmetic. Again, code generation is approximate retrieval, it can't generate anything outside of it's training distribution.

>>Davidz+(OP)
Not necessarily; much smaller models like T5 which in some ways introduced instructions (not RLHF yet) did have to include specific instructions for useful translation - of similar format to those you find in large scale web translation data, but this is coincidental: you can finetune it with whatever instruction word you want to indicate translation - the point is, a much smaller model can translate.

The base non-RLHF GPT models could do translation by prefixing by the target language and a semi colon, but only above a certain amount of parameters are they consistent. GPT-2 didn't always get it right and of course had general issues with continuity. However, you could always do some parts of translation with older transformer models like BERT, especially multilingual ones.

Larger models across different from-base training runs show that they become more effective at translation at certain points, but I think this is about the capacity to store information, not emergence per say (if you understand my difference here). You've probably noticed and it has always seemed to me 4B, 6B and 9B are the largest rough parameter sizes with 2020 style training set ups that you see the most general "appearance" of some useful behaviours that you could "glean" from the web and book data that doesn't include instructions, while consistency seems to remain the domain of larger models or mixed expert models and lots of RLHF training/tricks. The easiest way to see this is to compare GPT-2 large, GPT-J and GPT-20B and see how well they perform at different tasks. However the fact it's about size in these GPTs, and yet smaller models (T5 instruction tuned / multilingual BERT) can perform at the same level on some tasks implies that it is about what the model is focusing it's learning on for the training task at hand, and controllable, rather than being innate at a certain parameter size. Language translations just do make up a lot of the data. I don't think it would emerge if you removed all cases of translation / multi language input/outputs, definitely not at the same parameter size, even if you had the same overall proportion of languages in the training corpus, if that makes sense? It just seems too much an artefact of the corpus aligning with the task.

Likewise for code - Gpt-4 generated code is not like arithmetic in the sense of the way people might mean it for code (e.g. branching instructions / abstract syntax tree) - its a fundamentally local text form of generation, this is why it can happily add illegal imports etc to diffs (perhaps one day training will resolve this) - it doesn't have the AST or compiler or much consistent behaviour to imply it deeply understands as it writes the code what could occur.

However if recent reports about arithmetic being an area of improvement are true, I am very excited, as a lot of what I wrote above - will have to be reconceptualised... and that is the most exciting scenario...