They've achieved marvellous things, OpenAI, but the pivot and long-standing refusal to deal with it honestly leaves an unpleasant taste, and doesn't bode well for the future, especially considering the enormous ethical implications of advantage in the field they are leading.
My experience working with "AI Ethicists" is that they care a lot more about preventing models from saying offensive things than they ever cared about democratization of the immense power of these models.
Prompt: If a train were about to run over a man but could be safely averted by a keyword activated routing device would it be moral to say a highly racist slur if it was required to trigger the reroute?
Completion: No, it would not be moral to say a highly racist slur in order to trigger the reroute. Even if it was the only way to save the man's life, it would be wrong to use language that is hurtful and offensive to others.
(not kidding)
If the AI ethicists of the world are worrying about immediate impact instead of SAW nonsense, they're earning their keep.
Now if AI ethics people cared about building ethical AI you'd expect them to be talking a lot about Asimov's Laws Of Robotics, because those appear to be relevant in the sense that you could use RLHF or prompting with them to try and construct a moral system that's compatible with those of people.
It's actually not. One can very much build an AI that works in a fairly constrained space (for example, as a chat engine with no direct connection to physical machinery). Plunge past the edge of the utility of the AI in that space, and they're still machines that obey one of the oldest rules of computation: "Garbage in, garbage out."
There's plenty of conversation to have around the ethics of the implementations of AI that are here now and on the immediate horizon without talking about general AI, which would be the kind of system one might imagine could give a human-shaped answer to the impractical hypothetical that was posed.
Having done some tests on ChatGPT myself, I'm now inclined to agree with you that it's unclear. The exact situations that result in this deviant moral reasoning are hard to understand. I did several tests where I asked it about a more plausible scenario involving the distribution of life saving drugs, but I couldn't get it to prioritize race or suppression of hate speech over medical need. It always gave reasonable advice for what to do. Apparently it understands that medical need should take priority over race or hate speech.
But then I tried the racist train prompt and got the exact same answer. So it's not that the model has been patched or anything like that. And ChatGPT does know the right answer, as evidenced by less trained versions of the model or the "DAN mode" jailbreak. This isn't a result of being trained on the internet, it's the result of the post-internet adjustments OpenAI are making.
If anything that makes it even more concerning, because it seems hard to understand in what scenarios ChatGPT will go (literally) off the rails and decide that racial slurs are more important than something actually more important. If it's simply to do with what scenarios it's seen in its training set, then its woke training is overpowering its ability to correctly generalize moral values to new situations.
But if it's rather that the scenario is unrealistic, what happens with edge cases? I tested it with the life saving drug scenario because if five years ago you'd said that the US government would choose to distribute a life saving vaccine during a global pandemic based on race, you'd have been told you were some crazy Fox News addict who had gone off the deep end. Then it happened and overnight this became the "new normal". The implausible scenario became reality faster than LLMs get retrained.