> We show that scaling the pretrained text encoder size is more important than scaling the diffusion model size.
There seems to be an unexpected level of synergy between text and vision models. Can't wait to see what video and audio modalities will add to the mix.