zlacker

Figure A.4 in the linked paper is a good high level overview of this model. Shame it was hidden away on page 19 in the appendix!

Each box you see there has a section in the paper explaining it in more detail.

replies(1): >>hn_thr+Z2

>>london+(OP)
Uhh, yeah, I'm going to need much more of an ELI5 than that! Looking at Figure A.4, I understand (again, at a very high-level) the first step of "Frozen Text Encoder", and I have a decent understanding of the upsampling techniques used in the last 2 diffusion model steps, but the middle "Text-to-Image Diffusion Model" step that magically outputs a 64x64 pixel image of an actual golden retriever wearing an actual blue checkered beret and red-dotted turtleneck is where I go "WTF??".

replies(2): >>f38zf5+W4 >>sineno+P6

>>hn_thr+Z2
A good explanation is here.

https://www.youtube.com/watch?v=344w5h24-h8

>>hn_thr+Z2
> but the middle "Text-to-Image Diffusion Model" step that magically outputs a 64x64 pixel image of an actual golden retriever wearing an actual blue checkered beret and red-dotted turtleneck is where I go "WTF??".

It doesn't output it outright, it basically forms it slowly, finding and strengthening more and more finer-grained features among the dwindling noise, combining the learned associations of memorized convolutional texture primitives vs encoded text embeddings. In the limit of enough data the associations and primitives turn out composable enough to suffice for out-of-distribution benchmark scenes.

When you have a high-quality encoder of your modality into a compressed vector representation, the rest is optimization over a sufficiently high-dimensional, plastic computational substrate (model): https://moultano.wordpress.com/2020/10/18/why-deep-learning-...

It works because it should. The next question is: "What are the implications?".

Can we meaningfully represent every available modality in a single latent space, and freely interconvert composable gestalts like this https://files.catbox.moe/rmy40q.jpg ?