Particularly as you approach the point where the image quality itself is superb and people increasingly turn to attacking the semantics & control of the prompt to degrade the quality ("...The donkey is holding a rope on one end, the octopus is holding onto the other. The donkey holds the rope in its mouth. A cat is jumping over the rope..."). For that sort of thing, it's hard to see how simply beefing up the raw pixel-generating part will help much: if the input seed is incorrect and doesn't correctly encode a thumbnail sketch of how all these animals ought to be engaging in outdoors sports, there's nothing some low-level pixel-munging neurons can do to help much.