It doesn't output it outright, it basically forms it slowly, finding and strengthening more and more finer-grained features among the dwindling noise, combining the learned associations of memorized convolutional texture primitives vs encoded text embeddings. In the limit of enough data the associations and primitives turn out composable enough to suffice for out-of-distribution benchmark scenes.
When you have a high-quality encoder of your modality into a compressed vector representation, the rest is optimization over a sufficiently high-dimensional, plastic computational substrate (model): https://moultano.wordpress.com/2020/10/18/why-deep-learning-...
It works because it should. The next question is: "What are the implications?".
Can we meaningfully represent every available modality in a single latent space, and freely interconvert composable gestalts like this https://files.catbox.moe/rmy40q.jpg ?