Imagen, a text-to-image diffusion model

>>kevema+(OP)
As someone who has a layman's understanding of neural networks, and who did some neural network programming ~20 years ago before the real explosion of the field, can someone point to some resources where I can get a better understanding about how this magic works?

I mean, from my perspective, the skill in these (and DALL-E's) image reproductions is truly astonishing. Just looking for more information about how the software actually works, even if there are big chunks of it that are "this is beyond your understanding without taking some in-depth courses".

>>hn_thr+3j
Figure A.4 in the linked paper is a good high level overview of this model. Shame it was hidden away on page 19 in the appendix!

Each box you see there has a section in the paper explaining it in more detail.

>>london+Zj
Uhh, yeah, I'm going to need much more of an ELI5 than that! Looking at Figure A.4, I understand (again, at a very high-level) the first step of "Frozen Text Encoder", and I have a decent understanding of the upsampling techniques used in the last 2 diffusion model steps, but the middle "Text-to-Image Diffusion Model" step that magically outputs a 64x64 pixel image of an actual golden retriever wearing an actual blue checkered beret and red-dotted turtleneck is where I go "WTF??".

>>hn_thr+Ym
> but the middle "Text-to-Image Diffusion Model" step that magically outputs a 64x64 pixel image of an actual golden retriever wearing an actual blue checkered beret and red-dotted turtleneck is where I go "WTF??".

It doesn't output it outright, it basically forms it slowly, finding and strengthening more and more finer-grained features among the dwindling noise, combining the learned associations of memorized convolutional texture primitives vs encoded text embeddings. In the limit of enough data the associations and primitives turn out composable enough to suffice for out-of-distribution benchmark scenes.

When you have a high-quality encoder of your modality into a compressed vector representation, the rest is optimization over a sufficiently high-dimensional, plastic computational substrate (model): https://moultano.wordpress.com/2020/10/18/why-deep-learning-...

It works because it should. The next question is: "What are the implications?".

Can we meaningfully represent every available modality in a single latent space, and freely interconvert composable gestalts like this https://files.catbox.moe/rmy40q.jpg ?

zlacker