zlacker

This would have been an epic release two years ago, but there are now many well-established models in this area (DALL-E, Midjourney, Stable Diffusion). It would be great to see some comparisons or benchmarks to show Imagen 2 is a better alternative. As it stands, it's hard for me to tell if this is worth switching to.

replies(3): >>chanks+o1 >>Mashim+J2 >>larodi+kD2

>>apsec1+(OP)
Right? This page looks like basically every other generative image AI announcement page as well as basically every model page. They show a bunch of their cherry-picked examples that are still only like "pretty good" (relative to the rest of the industry, it's incredible tech compared to something like deepdream) and give you nothing to really differentiate it.

>>apsec1+(OP)
> it's hard for me to tell

I can only compare it to Stable Diffusion. But Imagen2 seems significant more advanced.

Try to do anything with text and SDxl. It's not easy and often messes up. I don't think you can get a clean logo with multiple text areas on sdxl.

Look at the prompt and image of the robin. That is mighty impressive.

replies(3): >>Ologn+U3 >>averev+r4 >>nabaki+jb

>>Mashim+J2
Stability AI has gaps in SDXL for text, but they seem to do a better job with Deep Floyd ( https://github.com/deep-floyd/IF ). I have done a lot of interesting text things with Deep Floyd

replies(1): >>Mashim+z5

>>Mashim+J2
yeah stable diffusion has very limited understanding of composition instructions. you can reliably get things drawn, but it's super hard to get a specific thing in a specific place (i.e "a man with blonde hairs near a girl with black hairs" is gonna assign hair color more or less randomly and there's no guarantee on how many people will be on the picture) - regional prompting and control net somewhat help, but regional prompting is very unreliable and control net is, well, not text to image.

dalle 3 gets things right most of the time

>>Ologn+U3
Looks good. But 24GB of vram is quite a lot for 1024x1024

replies(1): >>orbita+P7

>>Mashim+z5
This is a pixel diffusion model that doesn't use latent space encoding, hence the memory requirements. Besides, good prompt understanding requires large transformers for text encoding, usually far larger than the image generation part. DF IF is using T5.

You can use Harrlogos XL to produce text with SDXL, although it's mostly limited to short captions and logos. The other way (controlnets) is more involved. (and is actually useful)

>>Mashim+J2
> I can only compare it to Stable Diffusion. But Imagen2 seems significant more advanced.

I wouldn't say this until we are able to try it for ourselves. As we know, Google is prone to severe cherry picking and deceptive marketing.

replies(1): >>quitit+vM

>>nabaki+jb
Google has this thing of releasing concept videos but communicating them as product demos.

Overselling is not a winning strategy, especially when others are shipping genuinely good products.

Every time Google show off something new the first thing people now ask is what part Google faked (or extreme cherry picking).

>>apsec1+(OP)
I was going to pretty much state the same - the obvious, while also adding insult to injury by saying that with recent announces in the lats few weeks, it seems that Google desperately needs to shine in the world of AI, but fails to do so (despite 2000+ votes for new Bard, which is still not so good).

Now, from a designer perspective, honestly, I don't care too much who's the provider of the image, since one will have to anyway work more on it. So designers, illustrators, etc are not the target for such platforms, even though it seems counter-intuitive. If you ask me which system was the source for an image used for a poster last 12 months... well, I may remember, but is not of a paramount importance to the end result. After an year of active usage of DALLE2/3, SDXL, Midjourney (which is also SD of some sort) I can confidently state that there is much more work to do and a lot of prompting, to actually get something unique and something worth being used. Sadly the time taken is proportionate to working with actual real artist. Of course - the latter is likely to be hit by this new innovation, but perhaps not so much.

From the perspective of s.o. integrating text-ot-image - which is yet to be seen in a reasonable manner, like for a quest game with generative images - the API flexibility and cost would be the most important qualifier. Even then it may actually be better to run SD/XL. From cost perspective - all these services are still very pricey to be used for anything more serious than few one-shot images.