zlacker

Looks good. But 24GB of vram is quite a lot for 1024x1024

replies(1): >>orbita+g2

>>Mashim+(OP)
This is a pixel diffusion model that doesn't use latent space encoding, hence the memory requirements. Besides, good prompt understanding requires large transformers for text encoding, usually far larger than the image generation part. DF IF is using T5.

You can use Harrlogos XL to produce text with SDXL, although it's mostly limited to short captions and logos. The other way (controlnets) is more involved. (and is actually useful)