zlacker

As someone who has a layman's understanding of neural networks, and who did some neural network programming ~20 years ago before the real explosion of the field, can someone point to some resources where I can get a better understanding about how this magic works?

I mean, from my perspective, the skill in these (and DALL-E's) image reproductions is truly astonishing. Just looking for more information about how the software actually works, even if there are big chunks of it that are "this is beyond your understanding without taking some in-depth courses".

replies(3): >>london+W >>rvnx+Y >>astran+r2

>>hn_thr+(OP)
Figure A.4 in the linked paper is a good high level overview of this model. Shame it was hidden away on page 19 in the appendix!

Each box you see there has a section in the paper explaining it in more detail.

replies(1): >>hn_thr+V3

>>hn_thr+(OP)
Check https://github.com/multimodalart/majesty-diffusion or https://github.com/lucidrains/DALLE2-pytorch

There is a Google Colab workbook that you can try and run for free :)

This is the image-text pairs behind: https://laion.ai/laion-400-open-dataset/

>>hn_thr+(OP)
> I mean, from my perspective, the skill in these (and DALL-E's) image reproductions is truly astonishing.

A basic part of it is that neural networks combine learning and memorizing fluidly inside them, and these networks are really really big, so they can memorize stuff good.

So when you see it reproduce a Shiba Inu well, don’t think of it as “the model understands Shiba Inus”. Think of it as making a collage out of some Shiba Inu clip art it found on the internet. You’d do the same if someone asked you to make this image.

It’s certainly impressive that the lighting and blending are as good as they are though.

replies(2): >>hn_thr+15 >>Pheoni+rb

>>london+W
Uhh, yeah, I'm going to need much more of an ELI5 than that! Looking at Figure A.4, I understand (again, at a very high-level) the first step of "Frozen Text Encoder", and I have a decent understanding of the upsampling techniques used in the last 2 diffusion model steps, but the middle "Text-to-Image Diffusion Model" step that magically outputs a 64x64 pixel image of an actual golden retriever wearing an actual blue checkered beret and red-dotted turtleneck is where I go "WTF??".

replies(2): >>f38zf5+S5 >>sineno+L7

>>astran+r2
To be clear, I understand the general techniques about (a) how diffusion models can be used to upsample images and generate more photorealistic (or even "cartoon realistic") results and (b) I understand how they can do basic matching of "someone typed in Shiba Inu, look for images of Shiba Inus".

What I don't understand is how they do the composition. E.g. for "A giant cobra snake on a farm. The snake is made out of corn." I think I could understand how it could reproduce the "A giant cobra snake on a farm" part. What I don't understand is how it accurately pictured "The snake is made out of corn." part, when I'm guessing it has never seen images of snakes made out of corn, and the way it combined "snake" with "made out of corn", in a way that is pretty much how I imagined it would look, is the part I'm baffled by.

replies(2): >>zone41+8b >>sineno+ci

>>hn_thr+V3
A good explanation is here.

https://www.youtube.com/watch?v=344w5h24-h8

>>hn_thr+V3
> but the middle "Text-to-Image Diffusion Model" step that magically outputs a 64x64 pixel image of an actual golden retriever wearing an actual blue checkered beret and red-dotted turtleneck is where I go "WTF??".

It doesn't output it outright, it basically forms it slowly, finding and strengthening more and more finer-grained features among the dwindling noise, combining the learned associations of memorized convolutional texture primitives vs encoded text embeddings. In the limit of enough data the associations and primitives turn out composable enough to suffice for out-of-distribution benchmark scenes.

When you have a high-quality encoder of your modality into a compressed vector representation, the rest is optimization over a sufficiently high-dimensional, plastic computational substrate (model): https://moultano.wordpress.com/2020/10/18/why-deep-learning-...

It works because it should. The next question is: "What are the implications?".

Can we meaningfully represent every available modality in a single latent space, and freely interconvert composable gestalts like this https://files.catbox.moe/rmy40q.jpg ?

>>hn_thr+15
a) Diffusion is not just used to upsample images but also to create them.

b) It has seen images with descriptions of "corn," "cobra," "farm," and it has seen images of "A made out of B" and "C on a D." To generate a high-scoring image, it has to make something that scores well on all of them put together.

>>astran+r2
> these networks are really really big, so they can memorize stuff good.

People tend to really underestimate just how big these models are. Of course these models aren't simply "really really big" MLPs, but the cleverness of the techniques used to build them is only useful at insanely large scale.

I do find these models impressive as examples of "here's what the limit of insane amounts of data, insane amounts of compute can achieve with some matrix multiplication". But at the same time, that's all they are.

What saddens me about the rise of deep neural networks is it is really is the end of the era of true hackers. You can't reproduce this at home. You can't afford to reproduce this one in the cloud with any reasonable amount of funding. If you want to build this stuff your best bet is to go to top tier school, make the right connections and get hired by a mega-corp.

But the real tragedy here is that the output of this is honestly only interesting it if it's the work of some hacker fiddling around in their spare time. A couple of friend hacking in their garage making images of raccoon painting is pretty cool. One of the most powerful, well funded, owners of the likely the most compute resources on the planet doing this as their crowning achievement in AI... is depressing.

replies(2): >>rland+6e >>sineno+xh

>>Pheoni+rb
The hackers will not be far behind. You can run some of the v1 diffusion models on a local machine.

I think it's fair to say that this is the way it's always been. In 1990, you couldn't hack on an accurate fluid simulation at home, you needed to be at a university or research lab with access to a big cluster. But then, 10 years later, you could do it on a home PC. And then, 10 years after that, you could do it in a browser on the internet.

It's the same with this AI stuff.

I think if we weren't in the midst of this unique GPU supply crunch, the price of a used 1070 would be about $100 right now -- such a card would be state of the art 10 years ago!

replies(1): >>ChadNa+kk

>>Pheoni+rb
Some cutting-edge stuff is still being made by talented hackers using nothing but a rig of 8x 3090s: https://github.com/neonbjb/tortoise-tts

Other funding models are possible as well, in the grand scheme of things the price for these models is small enough.

>>hn_thr+15
> What I don't understand is how they do the composition

Convolutional filters lend themselves to rich combinatorics of compositions[1]: think of them as of context-dependent texture-atoms, repulsing and attracting over the variations of the local multi-dimensional context in the image. The composition is literally a convolutional transformation of local channels encoding related principal components of context.

Astronomical amounts of computations spent via training allow the network to form a lego-set of these texture-atoms in a general distribution of contexts.

At least this is my intuition for the nature of the convnets.

1. https://microscope.openai.com/models/contrastive_16x/image_b...

>>rland+6e
And the supply crunch is getting better (you can buy an RTZ 3080 at MSRP now!) and technological progress doesn't seem to be slowing down. If the rumors are to be believed, a 4090 will be close to twice as fast as a 3090.