yeah stable diffusion has very limited understanding of composition instructions. you can reliably get things drawn, but it's super hard to get a specific thing in a specific place (i.e "a man with blonde hairs near a girl with black hairs" is gonna assign hair color more or less randomly and there's no guarantee on how many people will be on the picture) - regional prompting and control net somewhat help, but regional prompting is very unreliable and control net is, well, not text to image.
dalle 3 gets things right most of the time