I guess that turns out to be not as important for end users as you'd think.
Anyway, DeepFloyd/IF has great comprehension. It is straightforward to improve that for Stable Diffusion, I cannot tell you exactly why they haven't tried this.
Also not sure if it can be extended with LORAs or by turning it into a video/3D model the same way an LDM can.