Statement from Scarlett Johansson on the OpenAI "Sky" voice

>>mjcl+(OP)
I found the whole ChatGPT-4o demo to be cringe inducing. The fact that Altman was explicitly, and desperately, trying to copy "her" at least makes it understandable why he didn't veto the bimbo persona - it's actually what he wanted. Great call by Scarlett Johansson in not wanting to be any part of it.

One thing these trained voices make clear is that it's a tts engine generating ChatGPT-4o's speech, same as before. The whole omni-modal spin suggesting that the model is natively consuming and generating speech appears to be bunk.

>>HarHar+mg
> One thing these trained voices make clear is that it's a tts engine generating ChatGPT-4o's speech, same as before.

I'm not familiar with the specifics of how AI models work but doesn't the ability from some of the demos rule out what you've said above? Eg. The speeding up and slowing down speech and the sarcasm don't seem possible if TTS was a separate component

>>monroe+gl
The older formant-based (vs speech sample based) speech sythesizers like DECTalk could do this too. You could select one of a half dozen voices (some male, some female), but also select the speed, word pronunciation/intonation, get it to sing, etc, because these are all just parameters feeding into the synthesizer.

It would be interesting to hear the details, but what OpenAI seem to have done is build a neural net based speech synthesizer which is similarly flexible because it it generating the audio itself (not stitching together samples) conditioned on the voice ("Sky", etc) it is meant to be mimicking. Dialing the emotion up/down is basically affecting the prosody and intonation. The singing is mostly extending vowel sounds and adding vibrato, but it'd be interesting to hear the details. In the demo Brockman refers to the "singing voice", so not clear if they can make any of the 5 (now 4!) voices sing.

In any case, it seems the audio is being generated by some such flexible tts, not just decoded from audio tokens generated by the model (which anyways would imply there was something - basically a tts - converting text tokens to audio tokens). They also used the same 5 voices in the previous ChatGPT which wasn't claiming to be omnimodal, so maybe basically the same tts being used.

zlacker