Imagen, a text-to-image diffusion model

>>kevema+(OP)
Interesting discovery they made

> We show that scaling the pretrained text encoder size is more important than scaling the diffusion model size.

There seems to be an unexpected level of synergy between text and vision models. Can't wait to see what video and audio modalities will add to the mix.

>>visarg+J5
Basically makes sense, no? DALLE-2 suffered from misunderstanding propositional logic, treating prompts as less structured then it should have. That's a text model issue! Compared to that, scaling up the image isn't as important (especially with a few passes).

>>ravi-d+GJ
Is there a way to confirm that this extra processing relates to the language structure, and not the processing of concepts?

I wouldn’t be surprised if the lack of video and 3D understanding in the image dataset training fails to understand things like the fear of heights, and the concept of gravity ends up being learned in the text processing weights.

>>espadr+fx1
I am sure the image-text-video-audio-games model will come soon. The recent Gato was one step in that direction. There's so much video content out there, it begs for modelling. I think robotics applications will benefit the most from video.

zlacker