zlacker

[parent] [thread] 0 comments
1. dave_s+(OP)[view] [source] 2022-05-24 10:37:54
Well, first they parse the language into a high level vector representation. Then they take images and add noise and train a model to remove the noise so it can start with a noisy image and produce a clear image from it. Then they train a model to map from the word representation for text to the noisy image representation for the corresponding image. Then they upsample twice to get to good resolution.

So text -> text representation -> most likely noised image space -> iteratively reduce noise N times -> upsample result

Something like that, please correct anything I'm missing.

Re: the snake corn question, it is mapping the "concept" of corn to the concept of a body as represented by intermediary learned vector representations.

[go to top]