Safety research on toy models will continue to provide developments, but the industry expectation appears to be that emergent properties puts a low ceiling on what can be learned about safety without researching on cutting edge models.
Altman touted the governance structure of OpenAI as a mechanism for ensuring the organisation's prioritisation of safety, but the reports of internal reallocation away from safety towards keeping ChatGPT running under load concern me. Now the board has demonstrated that it was technically capable but insufficiently powerful to keep these interests in line, it seems unclear how any safety-oriented organisation, including Anthropic, could avoid the accelerationist influence of funders.
- It can't plan
- It can't do arithmetic
- It can't reason
- It can approximately retrieve knowledge with a natural language query (there are some issues with this, but it's very good)
- It can encode data into natural languages and other modalities
I'm not worried about it, I am worried about how badly people have misunderstood what it can do and then attempted to use it for things that matter.
But I'm not surprised.
Also being better at humans at everything is not a prerequisite for danger. Probably a scary moment is when it could look at a C (or Rust, C++, whatever) codebase, find an exploit, and then use that exploit as a worm. If it can do that on everyday hardware not top end GPUs (either because the algorithms are made more efficient, or every iPhone has a tensor unit).
The base non-RLHF GPT models could do translation by prefixing by the target language and a semi colon, but only above a certain amount of parameters are they consistent. GPT-2 didn't always get it right and of course had general issues with continuity. However, you could always do some parts of translation with older transformer models like BERT, especially multilingual ones.
Larger models across different from-base training runs show that they become more effective at translation at certain points, but I think this is about the capacity to store information, not emergence per say (if you understand my difference here). You've probably noticed and it has always seemed to me 4B, 6B and 9B are the largest rough parameter sizes with 2020 style training set ups that you see the most general "appearance" of some useful behaviours that you could "glean" from the web and book data that doesn't include instructions, while consistency seems to remain the domain of larger models or mixed expert models and lots of RLHF training/tricks. The easiest way to see this is to compare GPT-2 large, GPT-J and GPT-20B and see how well they perform at different tasks. However the fact it's about size in these GPTs, and yet smaller models (T5 instruction tuned / multilingual BERT) can perform at the same level on some tasks implies that it is about what the model is focusing it's learning on for the training task at hand, and controllable, rather than being innate at a certain parameter size. Language translations just do make up a lot of the data. I don't think it would emerge if you removed all cases of translation / multi language input/outputs, definitely not at the same parameter size, even if you had the same overall proportion of languages in the training corpus, if that makes sense? It just seems too much an artefact of the corpus aligning with the task.
Likewise for code - Gpt-4 generated code is not like arithmetic in the sense of the way people might mean it for code (e.g. branching instructions / abstract syntax tree) - its a fundamentally local text form of generation, this is why it can happily add illegal imports etc to diffs (perhaps one day training will resolve this) - it doesn't have the AST or compiler or much consistent behaviour to imply it deeply understands as it writes the code what could occur.
However if recent reports about arithmetic being an area of improvement are true, I am very excited, as a lot of what I wrote above - will have to be reconceptualised... and that is the most exciting scenario...