zlacker

[parent] [thread] 2 comments
1. Xenoph+(OP)[view] [source] 2025-12-05 23:12:51
It mostly depends on "how" the models work. Multi-modal unified text/image sequence to sequence models can do this pretty well, diffusion doesn't.
replies(1): >>vunder+fn
2. vunder+fn[view] [source] 2025-12-06 02:42:33
>>Xenoph+(OP)
Multimodal certainly helps but "pretty well" is a stretch. I'd be curious to know what multimodal model in particular you've tried that could consistently handle generative prompts of the above nature (without human-in-the-loop corrections).

For example, to my knowledge ChatGPT is unified and I can guarantee it can't handle something like a 7-legged spider.

replies(1): >>Xenoph+U52
◧◩
3. Xenoph+U52[view] [source] [discussion] 2025-12-06 21:22:23
>>vunder+fn
I just got the model to generate a spider without a leg by saying "Spider missing one leg" and it did it fine. It won't do it "every time", (in my case 1 out of 2), but it will do it. I used the GPT-image-1 model in the api. I don't think they are actually running a full end to end text/image model sequence model. I don't think anyone really is commercially, they are hybrids as far as I know. Someone here probably has better information on the current architectures.
[go to top]