From technical standpoint, a finetuned voice model can be built from just few minutes of data and GPU time on top of an existing voice model, almost like how artists LoRAs are built for images. So it is entirely within possibility that that had happened.