I see people saying that these kinds of things are happening behind closed doors, but I haven't seen any convincing evidence of it, and there is enormous propensity for AI speculation to run rampant.
As others have pointed out in other threads RLHF has progressed beyond next-token prediction and modern models are modeling concepts [1].
[0] https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...
[1] https://www.anthropic.com/news/tracing-thoughts-language-mod...