zlacker

This is a pixel diffusion model that doesn't use latent space encoding, hence the memory requirements. Besides, good prompt understanding requires large transformers for text encoding, usually far larger than the image generation part. DF IF is using T5.

You can use Harrlogos XL to produce text with SDXL, although it's mostly limited to short captions and logos. The other way (controlnets) is more involved. (and is actually useful)