Imagen, a text-to-image diffusion model

>>kevema+(OP)
Can anybody give me short high-level explanation how the model achieves these results? I'm especially interested in the image synthesis, not the language parsing.

For example, what kind of source images are used for the snake made of corn[0]? It's baffling to me how the corn is mapped to the snake body.

[0] https://gweb-research-imagen.appspot.com/main_gallery_images...

>>geonic+bn1
Well, first they parse the language into a high level vector representation. Then they take images and add noise and train a model to remove the noise so it can start with a noisy image and produce a clear image from it. Then they train a model to map from the word representation for text to the noisy image representation for the corresponding image. Then they upsample twice to get to good resolution.

So text -> text representation -> most likely noised image space -> iteratively reduce noise N times -> upsample result

Something like that, please correct anything I'm missing.

Re: the snake corn question, it is mapping the "concept" of corn to the concept of a body as represented by intermediary learned vector representations.

zlacker