I have no special insight into what they're actually doing, but speeding up and slowing down speech have been features of SSML for a long time. If they are generating a similar markup language it's not inconceivable that it would be possible to do what you're describing.
>>mmcwil+(OP)
It's also possible that any such enunciation is being hallucinated from the text by the speech model.
AI models exist to make up bullshit that fills a gap. When you have a conversation with any LLM it's merely autocompleting the next few lines of what it thinks is a movie script.