> We show that scaling the pretrained text encoder size is more important than scaling the diffusion model size.
There seems to be an unexpected level of synergy between text and vision models. Can't wait to see what video and audio modalities will add to the mix.
I wouldn’t be surprised if the lack of video and 3D understanding in the image dataset training fails to understand things like the fear of heights, and the concept of gravity ends up being learned in the text processing weights.