Your description is closer to how the open source CLIP+GAN models did it - if you ask for “tree” it starts growing the picture towards treeness until it’s all averagely tree-y rather than being “a picture of a single tree”.
It would be nice if asking for N samples got a diversity of traits you didn’t explicitly ask for. OpenAI seems to solve this by not letting you see it generate humans at all…