One thing these trained voices make clear is that it's a tts engine generating ChatGPT-4o's speech, same as before. The whole omni-modal spin suggesting that the model is natively consuming and generating speech appears to be bunk.
This doesn't make any sense. If it's a speech to speech transformer then 'training' could just be a sample at the beginning of the context window. Or it could one of several voices used for the Instruct-tuning or RLHF process. Either way, it doesn't debunk anything.