Imagen, a text-to-image diffusion model

>>kevema+(OP)
As someone who has a layman's understanding of neural networks, and who did some neural network programming ~20 years ago before the real explosion of the field, can someone point to some resources where I can get a better understanding about how this magic works?

I mean, from my perspective, the skill in these (and DALL-E's) image reproductions is truly astonishing. Just looking for more information about how the software actually works, even if there are big chunks of it that are "this is beyond your understanding without taking some in-depth courses".

>>hn_thr+3j
Figure A.4 in the linked paper is a good high level overview of this model. Shame it was hidden away on page 19 in the appendix!

Each box you see there has a section in the paper explaining it in more detail.

>>london+Zj
Uhh, yeah, I'm going to need much more of an ELI5 than that! Looking at Figure A.4, I understand (again, at a very high-level) the first step of "Frozen Text Encoder", and I have a decent understanding of the upsampling techniques used in the last 2 diffusion model steps, but the middle "Text-to-Image Diffusion Model" step that magically outputs a 64x64 pixel image of an actual golden retriever wearing an actual blue checkered beret and red-dotted turtleneck is where I go "WTF??".

zlacker