Why bother using a product from a company that is notorious for failing to commit to most of their services, when you can run something which produces output that is pretty close (and maybe better) and is free to run and change and train?
Stable Diffusion is the Linux-on-the-desktop of diffusion models IMO
(I agree w/ your comment on trusting Google - pretty sure they'll just phase this off eventually anyway, so I wouldn't bother trying it)
Because it costs $0.02 per image instead of $1000 on a graphics card and endless buggering around to set up.
Linux entered the market at a time when paid alternatives were fully established and concentrated, servicing users/companies for years who became used to working with them. No paid txt2img offering comes anywhere close to market dominance for image generation. They don't offer anything that isn't available with free alternatives (they actually offer less) and are highly restrictive in comparison. Anyone who is doing anything beyond disguised DALLE/Imagen clients, has absolutely no incentives to use a paid service.
*it also takes like 15 mins to setup up (this includes loading the models).
This makes the image much more usable without editing.
(DALL-E pretends to do that, but it's actually just using GPT-4 Vision to create a description of the image and then prompting based on that.)
Live editing tools like https://drawfast.tldraw.com/ are increasingly being built on top of Stable Diffusion, and are far and away the most interesting way to interact with image generation models. You can't build that on DALL-E 3.
Still, Stable Diffusion is losing the usability, tooling and integration game. The people who care to make interfaces for it mostly treat it as an expert tool, not something for people who have never heard of image generating AI. Many competing services have better out-of-the-box results (for people who don't know what a negative prompt is), easier hosting, user friendly integrations in tools that matter, better hosted services, etc.
I guess that turns out to be not as important for end users as you'd think.
Anyway, DeepFloyd/IF has great comprehension. It is straightforward to improve that for Stable Diffusion, I cannot tell you exactly why they haven't tried this.
If you're just generating something for fun then DallE/MJ is probably sufficient, but if you're doing a project that requires specific details/style/consistency you're going to need way more tools. With SD/A*1111 you can use a specific model (one that generates images with an Anime style for instance), use a ControlNet model for a specific pose, generate hundreds of potential images (without having to pay for each), use other tools like img2img/inpaint to hone your vision using the images you like, and if you're looking for a specific effect (like a gif for instance), you can use the many extensions created by the community to make it happen.
Then this: https://civitai.com/
And I have completely abandoned DALLE and will likely never use it again.
But it clearly didn't win in many scenarios, especially those require text to be precise, and that happens to be more important in commercial setting, to clear up those gibberish texts generated by OSS stable diffusion seems tiring by itself.
It installs dozens upon dozens of models and related scripts painlessly.
Also not sure if it can be extended with LORAs or by turning it into a video/3D model the same way an LDM can.
i'm one of the founders