>>mywitt+(OP)
One of the underlying models, PULSE, was trained on CelebAHQ, which is likely what the results are mostly white-looking. StyleGAN, which was trained on the much more diverse (but sparse) FFHQ dataset does come up with a much more diverse set of faces[1]...but PULSE couldn't get them to converge very closely on the pixelated subjects...so they went with CelebA [2].
[1] https://github.com/NVlabs/stylegan
[2] https://arxiv.org/pdf/2003.03808.pdf (ctrl+f ffhq)