Google Imagen 2

>>geox+(OP)
Kinda scratching my head at the purpose of the prompt understanding examples they show off. From previous papers I've seen in the space, shouldn't they be trying various compositional things like "A blue cube next to a red sphere" and variations thereof?

Instead they use

>The robin flew from his swinging spray of ivy on to the top of the wall and he opened his beak and sang a loud, lovely trill, merely to show off. Nothing in the world is quite as adorably lovely as a robin when he shows off - and they are nearly always doing it.

And show off the result being a photograph of a robin, cool. SDXL[0] can do the exact same thing given the same prompt, in fact even SD1.5 would be able to easily[1].

[0]https://i.imgur.com/rsgtYbf.png

[1]https://i.imgur.com/1rcQpcQ.png

>>Jackso+wb1
I've developed two tests for AI image generators to see if they've actually advanced to "the next level". Take literally any AI image generator and give it one of these prompts:

"A flying squirrel gliding between trees": It won't be able to do it. Just telling it "flying squirrel" will often generate squirrels with bat wings coming off their backs.

Ahh, but that's just a tiny, specific thing missing from the data set! Surely that'll get fixed eventually as they add more training data...

"A fox girl hugging a bunny girl hugging a cat girl": The only way to make this work is with fancy stuff like Segment Anything (SAM) working with Stable Diffusion. Alternative prompts of the same thing:

"A fox girl and a bunny girl and a cat girl all hugging each other"

It's such a simple thing; generative AI can make three people hugging each other no problem. However, trying to get it to generate three different types of people in the same scene is really, really hard and largely dependent on luck.

zlacker