zlacker

[return to "Statement from Scarlett Johansson on the OpenAI "Sky" voice"]
1. HarHar+mg[view] [source] 2024-05-21 00:01:25
>>mjcl+(OP)
I found the whole ChatGPT-4o demo to be cringe inducing. The fact that Altman was explicitly, and desperately, trying to copy "her" at least makes it understandable why he didn't veto the bimbo persona - it's actually what he wanted. Great call by Scarlett Johansson in not wanting to be any part of it.

One thing these trained voices make clear is that it's a tts engine generating ChatGPT-4o's speech, same as before. The whole omni-modal spin suggesting that the model is natively consuming and generating speech appears to be bunk.

◧◩
2. monroe+gl[view] [source] 2024-05-21 00:33:23
>>HarHar+mg
> One thing these trained voices make clear is that it's a tts engine generating ChatGPT-4o's speech, same as before.

I'm not familiar with the specifics of how AI models work but doesn't the ability from some of the demos rule out what you've said above? Eg. The speeding up and slowing down speech and the sarcasm don't seem possible if TTS was a separate component

◧◩◪
3. nabaki+oE[view] [source] 2024-05-21 03:20:32
>>monroe+gl
Azure Speech tts is capable of speeding up, slowing down, sarcasm, etc with SSML. I wouldn't be surprised if it's what OpenAI is using on the backend.
◧◩◪◨
4. vessen+PE[view] [source] 2024-05-21 03:25:48
>>nabaki+oE
Greg has specifically said it's not an SSML-parsing text model; he's said it's an end to end multimodal model.

FWIW, I would find it very surprising if you could get the low latency expressiveness, singing, harmonizing, sarcasm and interpretation of incoming voice through SSML -- that would be a couple orders of magnitude better than any SSML product I've seen.

[go to top]