zlacker

Interesting discovery they made

> We show that scaling the pretrained text encoder size is more important than scaling the diffusion model size.

There seems to be an unexpected level of synergy between text and vision models. Can't wait to see what video and audio modalities will add to the mix.

replies(2): >>gwern+0v >>ravi-d+XD

>>visarg+(OP)
I think that's unsurprising. With DALL-E 1, for example, scaling the VAE (the image model generating the actual pixels) hits very fast diminishing returns, and all your compute goes into the 'text encoder' generating the token sequence.

Particularly as you approach the point where the image quality itself is superb and people increasingly turn to attacking the semantics & control of the prompt to degrade the quality ("...The donkey is holding a rope on one end, the octopus is holding onto the other. The donkey holds the rope in its mouth. A cat is jumping over the rope..."). For that sort of thing, it's hard to see how simply beefing up the raw pixel-generating part will help much: if the input seed is incorrect and doesn't correctly encode a thumbnail sketch of how all these animals ought to be engaging in outdoors sports, there's nothing some low-level pixel-munging neurons can do to help much.

replies(1): >>visarg+Dt2

>>visarg+(OP)
Basically makes sense, no? DALLE-2 suffered from misunderstanding propositional logic, treating prompts as less structured then it should have. That's a text model issue! Compared to that, scaling up the image isn't as important (especially with a few passes).

replies(1): >>espadr+wr1

>>ravi-d+XD
Is there a way to confirm that this extra processing relates to the language structure, and not the processing of concepts?

I wouldn’t be surprised if the lack of video and 3D understanding in the image dataset training fails to understand things like the fear of heights, and the concept of gravity ends up being learned in the text processing weights.

replies(1): >>visarg+pu2

>>gwern+0v
I was thinking more about our traditional ResNet50 trained on ImageNet vs CLIP. ResNet was limited to a thousand classes and brittle. CLIP can generalise to new concept combinations with ease. That changes the game, and the jump is based on NLP.

>>espadr+wr1
I am sure the image-text-video-audio-games model will come soon. The recent Gato was one step in that direction. There's so much video content out there, it begs for modelling. I think robotics applications will benefit the most from video.