Imagen, a text-to-image diffusion model

>>kevema+(OP)
I thought I was doing well after not being overly surprised by DALL-E 2 or Gato. How am I still not calibrated on this stuff? I know I am meant to be the one who constantly argues that language models already have sophisticated semantic understanding, and that you don't need visual senses to learn grounded world knowledge of this sort, but come on, you don't get to just throw T5 in a multimodal model as-is and have it work better than multimodal transformers! VLM[1] at least added fine-tuned internal components.

Good lord we are screwed. And yet somehow I bet even this isn't going to kill off the they're just statistical interpolators meme.

[1] https://www.deepmind.com/blog/tackling-multiple-tasks-with-a...

>>Veedra+7w
I think it's something like a very intelligent Borgian library of babel. There are all sorts of books in there, by authors with conflicting opinions and styles, due to the source material. The librarian is very good at giving you something you want to read, but that doesn't mean it has coherent opinions. It doesn't know or care what's authentic and what's a forgery. It's great for entertainment, but you wouldn't want to do research there.

For image generation, it's obviously all fiction. Which is fine and mostly harmless if you you know what you're getting. It's going to leak out onto the Internet, though, and there will be photos that get passed around as real.

For text, it's all fiction too, but this isn't obvious to everyone because sometimes it's based on true facts. There's often not going to be an obvious place where the facts stop and the fiction starts.

The raw Internet is going to turn into a mountain of this stuff. Authenticating information is going to become a lot more important.

zlacker