zlacker

I think that's unsurprising. With DALL-E 1, for example, scaling the VAE (the image model generating the actual pixels) hits very fast diminishing returns, and all your compute goes into the 'text encoder' generating the token sequence.

Particularly as you approach the point where the image quality itself is superb and people increasingly turn to attacking the semantics & control of the prompt to degrade the quality ("...The donkey is holding a rope on one end, the octopus is holding onto the other. The donkey holds the rope in its mouth. A cat is jumping over the rope..."). For that sort of thing, it's hard to see how simply beefing up the raw pixel-generating part will help much: if the input seed is incorrect and doesn't correctly encode a thumbnail sketch of how all these animals ought to be engaging in outdoors sports, there's nothing some low-level pixel-munging neurons can do to help much.

replies(1): >>visarg+DY1

>>gwern+(OP)
I was thinking more about our traditional ResNet50 trained on ImageNet vs CLIP. ResNet was limited to a thousand classes and brittle. CLIP can generalise to new concept combinations with ease. That changes the game, and the jump is based on NLP.