zlacker

> One thing these trained voices make clear is that it's a tts engine generating ChatGPT-4o's speech, same as before.

I'm not familiar with the specifics of how AI models work but doesn't the ability from some of the demos rule out what you've said above? Eg. The speeding up and slowing down speech and the sarcasm don't seem possible if TTS was a separate component

replies(3): >>mmcwil+g3 >>HarHar+x8 >>nabaki+8j

>>monroe+(OP)
I have no special insight into what they're actually doing, but speeding up and slowing down speech have been features of SSML for a long time. If they are generating a similar markup language it's not inconceivable that it would be possible to do what you're describing.

replies(1): >>Grille+Dc

>>monroe+(OP)
The older formant-based (vs speech sample based) speech sythesizers like DECTalk could do this too. You could select one of a half dozen voices (some male, some female), but also select the speed, word pronunciation/intonation, get it to sing, etc, because these are all just parameters feeding into the synthesizer.

It would be interesting to hear the details, but what OpenAI seem to have done is build a neural net based speech synthesizer which is similarly flexible because it it generating the audio itself (not stitching together samples) conditioned on the voice ("Sky", etc) it is meant to be mimicking. Dialing the emotion up/down is basically affecting the prosody and intonation. The singing is mostly extending vowel sounds and adding vibrato, but it'd be interesting to hear the details. In the demo Brockman refers to the "singing voice", so not clear if they can make any of the 5 (now 4!) voices sing.

In any case, it seems the audio is being generated by some such flexible tts, not just decoded from audio tokens generated by the model (which anyways would imply there was something - basically a tts - converting text tokens to audio tokens). They also used the same 5 voices in the previous ChatGPT which wasn't claiming to be omnimodal, so maybe basically the same tts being used.

>>mmcwil+g3
It's also possible that any such enunciation is being hallucinated from the text by the speech model.

AI models exist to make up bullshit that fills a gap. When you have a conversation with any LLM it's merely autocompleting the next few lines of what it thinks is a movie script.

>>monroe+(OP)
Azure Speech tts is capable of speeding up, slowing down, sarcasm, etc with SSML. I wouldn't be surprised if it's what OpenAI is using on the backend.

replies(1): >>vessen+zj

>>nabaki+8j
Greg has specifically said it's not an SSML-parsing text model; he's said it's an end to end multimodal model.

FWIW, I would find it very surprising if you could get the low latency expressiveness, singing, harmonizing, sarcasm and interpretation of incoming voice through SSML -- that would be a couple orders of magnitude better than any SSML product I've seen.

replies(1): >>nabaki+ex3

>>vessen+zj
Not sure about the low latency aspect, but I've seen everything else you mentioned with SSML. Also, I can't find where Greg said that, could you point me to it?