This makes the image much more usable without editing.
(DALL-E pretends to do that, but it's actually just using GPT-4 Vision to create a description of the image and then prompting based on that.)
Live editing tools like https://drawfast.tldraw.com/ are increasingly being built on top of Stable Diffusion, and are far and away the most interesting way to interact with image generation models. You can't build that on DALL-E 3.
I guess that turns out to be not as important for end users as you'd think.
Anyway, DeepFloyd/IF has great comprehension. It is straightforward to improve that for Stable Diffusion, I cannot tell you exactly why they haven't tried this.
If you're just generating something for fun then DallE/MJ is probably sufficient, but if you're doing a project that requires specific details/style/consistency you're going to need way more tools. With SD/A*1111 you can use a specific model (one that generates images with an Anime style for instance), use a ControlNet model for a specific pose, generate hundreds of potential images (without having to pay for each), use other tools like img2img/inpaint to hone your vision using the images you like, and if you're looking for a specific effect (like a gif for instance), you can use the many extensions created by the community to make it happen.
But it clearly didn't win in many scenarios, especially those require text to be precise, and that happens to be more important in commercial setting, to clear up those gibberish texts generated by OSS stable diffusion seems tiring by itself.
Also not sure if it can be extended with LORAs or by turning it into a video/3D model the same way an LDM can.