zlacker

[return to "Imagen, a text-to-image diffusion model"]
1. hn_thr+3j[view] [source] 2022-05-23 22:40:22
>>kevema+(OP)
As someone who has a layman's understanding of neural networks, and who did some neural network programming ~20 years ago before the real explosion of the field, can someone point to some resources where I can get a better understanding about how this magic works?

I mean, from my perspective, the skill in these (and DALL-E's) image reproductions is truly astonishing. Just looking for more information about how the software actually works, even if there are big chunks of it that are "this is beyond your understanding without taking some in-depth courses".

◧◩
2. astran+ul[view] [source] 2022-05-23 22:59:17
>>hn_thr+3j
> I mean, from my perspective, the skill in these (and DALL-E's) image reproductions is truly astonishing.

A basic part of it is that neural networks combine learning and memorizing fluidly inside them, and these networks are really really big, so they can memorize stuff good.

So when you see it reproduce a Shiba Inu well, don’t think of it as “the model understands Shiba Inus”. Think of it as making a collage out of some Shiba Inu clip art it found on the internet. You’d do the same if someone asked you to make this image.

It’s certainly impressive that the lighting and blending are as good as they are though.

◧◩◪
3. hn_thr+4o[view] [source] 2022-05-23 23:17:17
>>astran+ul
To be clear, I understand the general techniques about (a) how diffusion models can be used to upsample images and generate more photorealistic (or even "cartoon realistic") results and (b) I understand how they can do basic matching of "someone typed in Shiba Inu, look for images of Shiba Inus".

What I don't understand is how they do the composition. E.g. for "A giant cobra snake on a farm. The snake is made out of corn." I think I could understand how it could reproduce the "A giant cobra snake on a farm" part. What I don't understand is how it accurately pictured "The snake is made out of corn." part, when I'm guessing it has never seen images of snakes made out of corn, and the way it combined "snake" with "made out of corn", in a way that is pretty much how I imagined it would look, is the part I'm baffled by.

◧◩◪◨
4. zone41+bu[view] [source] 2022-05-24 00:09:32
>>hn_thr+4o
a) Diffusion is not just used to upsample images but also to create them.

b) It has seen images with descriptions of "corn," "cobra," "farm," and it has seen images of "A made out of B" and "C on a D." To generate a high-scoring image, it has to make something that scores well on all of them put together.

[go to top]