Imagen, a text-to-image diffusion model

>>kevema+(OP)
I thought I was doing well after not being overly surprised by DALL-E 2 or Gato. How am I still not calibrated on this stuff? I know I am meant to be the one who constantly argues that language models already have sophisticated semantic understanding, and that you don't need visual senses to learn grounded world knowledge of this sort, but come on, you don't get to just throw T5 in a multimodal model as-is and have it work better than multimodal transformers! VLM[1] at least added fine-tuned internal components.

Good lord we are screwed. And yet somehow I bet even this isn't going to kill off the they're just statistical interpolators meme.

[1] https://www.deepmind.com/blog/tackling-multiple-tasks-with-a...

>>Veedra+7w
I haven't been overly surprised by any of it. The final product is still the same, no matter how much they scale it up.

All of these models seem to require a human to evaluate and edit the results. Even Co-Pilot. In theory this will reduce the number of human hours required to write text or create images. But I haven't seen anyone doing that successfully at scale or solving the associated problems yet.

I'm pessimistic about the current state of AI research. It seems like it's been more of the same for many years now.

zlacker