> We show that scaling the pretrained text encoder size is more important than scaling the diffusion model size.
There seems to be an unexpected level of synergy between text and vision models. Can't wait to see what video and audio modalities will add to the mix.
Particularly as you approach the point where the image quality itself is superb and people increasingly turn to attacking the semantics & control of the prompt to degrade the quality ("...The donkey is holding a rope on one end, the octopus is holding onto the other. The donkey holds the rope in its mouth. A cat is jumping over the rope..."). For that sort of thing, it's hard to see how simply beefing up the raw pixel-generating part will help much: if the input seed is incorrect and doesn't correctly encode a thumbnail sketch of how all these animals ought to be engaging in outdoors sports, there's nothing some low-level pixel-munging neurons can do to help much.
I wouldn’t be surprised if the lack of video and 3D understanding in the image dataset training fails to understand things like the fear of heights, and the concept of gravity ends up being learned in the text processing weights.