One thing these trained voices make clear is that it's a tts engine generating ChatGPT-4o's speech, same as before. The whole omni-modal spin suggesting that the model is natively consuming and generating speech appears to be bunk.
Was quickly apparent that text only is a poor medium for the variety and scope of signals that could be communicated by these multimodal networks.