zlacker

I found the whole ChatGPT-4o demo to be cringe inducing. The fact that Altman was explicitly, and desperately, trying to copy "her" at least makes it understandable why he didn't veto the bimbo persona - it's actually what he wanted. Great call by Scarlett Johansson in not wanting to be any part of it.

One thing these trained voices make clear is that it's a tts engine generating ChatGPT-4o's speech, same as before. The whole omni-modal spin suggesting that the model is natively consuming and generating speech appears to be bunk.

replies(4): >>aabhay+f3 >>leumon+D4 >>monroe+U4 >>famous+m9

>>HarHar+(OP)
I wouldn’t go as far as your last statement. While shocking, it’s not inconceivable that there’s native token I/O for audio. In fact tokenizing audio directly actually seems more efficient since the tokenization could be local.

Nevertheless. This is still incredibly embarrassing for OpenAI. And totally hurts the company’s aspiration to be good for humanity.

replies(1): >>timeon+KZ

>>HarHar+(OP)
I think it is more then a simple tts engine. At least from the demo, they showed: It can control the speed and it can sing when requested. Maybe its still a seperate speech engine, but more closely connected to the llm.

replies(3): >>sooheo+y8 >>kromem+ol >>nabaki+Pn

>>HarHar+(OP)
> One thing these trained voices make clear is that it's a tts engine generating ChatGPT-4o's speech, same as before.

I'm not familiar with the specifics of how AI models work but doesn't the ability from some of the demos rule out what you've said above? Eg. The speeding up and slowing down speech and the sarcasm don't seem possible if TTS was a separate component

replies(3): >>mmcwil+a8 >>HarHar+rd >>nabaki+2o

>>monroe+U4
I have no special insight into what they're actually doing, but speeding up and slowing down speech have been features of SSML for a long time. If they are generating a similar markup language it's not inconceivable that it would be possible to do what you're describing.

replies(1): >>Grille+xh

>>leumon+D4
tts with separate channels for style would do it, no?

>>HarHar+(OP)
>One thing these trained voices make clear is that it's a tts engine generating ChatGPT-4o's speech, same as before. The whole omni-modal spin suggesting that the model is natively consuming and generating speech appears to be bunk.

This doesn't make any sense. If it's a speech to speech transformer then 'training' could just be a sample at the beginning of the context window. Or it could one of several voices used for the Instruct-tuning or RLHF process. Either way, it doesn't debunk anything.

>>monroe+U4
The older formant-based (vs speech sample based) speech sythesizers like DECTalk could do this too. You could select one of a half dozen voices (some male, some female), but also select the speed, word pronunciation/intonation, get it to sing, etc, because these are all just parameters feeding into the synthesizer.

It would be interesting to hear the details, but what OpenAI seem to have done is build a neural net based speech synthesizer which is similarly flexible because it it generating the audio itself (not stitching together samples) conditioned on the voice ("Sky", etc) it is meant to be mimicking. Dialing the emotion up/down is basically affecting the prosody and intonation. The singing is mostly extending vowel sounds and adding vibrato, but it'd be interesting to hear the details. In the demo Brockman refers to the "singing voice", so not clear if they can make any of the 5 (now 4!) voices sing.

In any case, it seems the audio is being generated by some such flexible tts, not just decoded from audio tokens generated by the model (which anyways would imply there was something - basically a tts - converting text tokens to audio tokens). They also used the same 5 voices in the previous ChatGPT which wasn't claiming to be omnimodal, so maybe basically the same tts being used.

>>mmcwil+a8
It's also possible that any such enunciation is being hallucinated from the text by the speech model.

AI models exist to make up bullshit that fills a gap. When you have a conversation with any LLM it's merely autocompleting the next few lines of what it thinks is a movie script.

>>leumon+D4
Most impressive was the incredulity to the 'okay' during the counting demo after the nth interruption.

Was quickly apparent that text only is a poor medium for the variety and scope of signals that could be communicated by these multimodal networks.

>>leumon+D4
Azure Speech tts is capable of doing this with SSML. I wouldn't be surprised if it's what OpenAI is using on the backend.

>>monroe+U4
Azure Speech tts is capable of speeding up, slowing down, sarcasm, etc with SSML. I wouldn't be surprised if it's what OpenAI is using on the backend.

replies(1): >>vessen+to

>>nabaki+2o
Greg has specifically said it's not an SSML-parsing text model; he's said it's an end to end multimodal model.

FWIW, I would find it very surprising if you could get the low latency expressiveness, singing, harmonizing, sarcasm and interpretation of incoming voice through SSML -- that would be a couple orders of magnitude better than any SSML product I've seen.

replies(1): >>nabaki+8C3

>>aabhay+f3
> company’s aspiration to be good for humanity

Seems like they abandoned it pretty early - if it was real in the first place.

>>vessen+to
Not sure about the low latency aspect, but I've seen everything else you mentioned with SSML. Also, I can't find where Greg said that, could you point me to it?