zlacker

To be clear, I understand the general techniques about (a) how diffusion models can be used to upsample images and generate more photorealistic (or even "cartoon realistic") results and (b) I understand how they can do basic matching of "someone typed in Shiba Inu, look for images of Shiba Inus".

What I don't understand is how they do the composition. E.g. for "A giant cobra snake on a farm. The snake is made out of corn." I think I could understand how it could reproduce the "A giant cobra snake on a farm" part. What I don't understand is how it accurately pictured "The snake is made out of corn." part, when I'm guessing it has never seen images of snakes made out of corn, and the way it combined "snake" with "made out of corn", in a way that is pretty much how I imagined it would look, is the part I'm baffled by.

replies(2): >>zone41+76 >>sineno+bd

>>hn_thr+(OP)
a) Diffusion is not just used to upsample images but also to create them.

b) It has seen images with descriptions of "corn," "cobra," "farm," and it has seen images of "A made out of B" and "C on a D." To generate a high-scoring image, it has to make something that scores well on all of them put together.

>>hn_thr+(OP)
> What I don't understand is how they do the composition

Convolutional filters lend themselves to rich combinatorics of compositions[1]: think of them as of context-dependent texture-atoms, repulsing and attracting over the variations of the local multi-dimensional context in the image. The composition is literally a convolutional transformation of local channels encoding related principal components of context.

Astronomical amounts of computations spent via training allow the network to form a lego-set of these texture-atoms in a general distribution of contexts.

At least this is my intuition for the nature of the convnets.

1. https://microscope.openai.com/models/contrastive_16x/image_b...