I can select exactly where I want changes and have targeted element removal in Photoshop. If I submit the image and try to describe my desired changes textually, I get less easily-controllable output. (And I might still get scrambled text, for instance, in parts of the image that it didn't even need to touch.)
I think this sort of task-specific specialization will have a long future, hard to imagine pure-text once again being the dominant information transfer method for 90% of the things we do with computers after 40 years of building specialized non-text interfaces.
I was a bit surprised by how it still resulted in gibberish text on posters in the background in an unaffected part of the image that at first glance didn't change at all. So even just the "masking" ability of like "anything outside of this range should not be touched" of a GUI would be a godsend.