For example, what kind of source images are used for the snake made of corn[0]? It's baffling to me how the corn is mapped to the snake body.
[0] https://gweb-research-imagen.appspot.com/main_gallery_images...
So text -> text representation -> most likely noised image space -> iteratively reduce noise N times -> upsample result
Something like that, please correct anything I'm missing.
Re: the snake corn question, it is mapping the "concept" of corn to the concept of a body as represented by intermediary learned vector representations.