zlacker

[return to "Imagen, a text-to-image diffusion model"]

>>kevema+(OP)
Interesting discovery they made

> We show that scaling the pretrained text encoder size is more important than scaling the diffusion model size.

There seems to be an unexpected level of synergy between text and vision models. Can't wait to see what video and audio modalities will add to the mix.

>>visarg+J5
Basically makes sense, no? DALLE-2 suffered from misunderstanding propositional logic, treating prompts as less structured then it should have. That's a text model issue! Compared to that, scaling up the image isn't as important (especially with a few passes).

[go to top]