I just got the model to generate a spider without a leg by saying "Spider missing one leg" and it did it fine. It won't do it "every time", (in my case 1 out of 2), but it will do it. I used the GPT-image-1 model in the api. I don't think they are actually running a full end to end text/image model sequence model. I don't think anyone really is commercially, they are hybrids as far as I know. Someone here probably has better information on the current architectures.