Just couple minutes of data through 10-20 minutes of training with RVC WebUI[0] on included base model into VC Client[1] gets you to 90% there. But that's nearly an year old method, so I'm sure OAI has its own completely novel architecture for extra 5% fidelity.
1: https://github.com/RVC-Project/Retrieval-based-Voice-Convers...
2: https://github.com/w-okada/voice-changer