Well, first they parse the language into a high level vector representation. Then they take images and add noise and train a model to remove the noise so it can start with a noisy image and produce a clear image from it. Then they train a model to map from the word representation for text to the noisy image representation for the corresponding image. Then they upsample twice to get to good resolution.
So text -> text representation -> most likely noised image space -> iteratively reduce noise N times -> upsample result
Something like that, please correct anything I'm missing.
Re: the snake corn question, it is mapping the "concept" of corn to the concept of a body as represented by intermediary learned vector representations.