Reproducing parts of existing images in the dataset is called overfitting and is considered a failure of the model.
i wrote an OCR program in college. we split the data set in half. you train it on one half then test it against the other half.
you can train stable diffusion on half the images, but then what? you use the image descriptions of the other half and measure how similar they are? in essence, attempting to reproduce exact replicas. but i guess even then it wouldn't be copyright if those images weren't used in the model. more like me describing something vividly to you and asking you to paint it and then getting angry at you because its too accurate
The guidance component is a vector representation of the text that changes where we are in the sample space. A change in the sample space changes likelihood so for the different prompts the likelihood of the same output image for the same input image will be different.
Since the model is trained to maximise the ELBO, it will produce a change closer to the prompt.
A good way to think about it is this: given a classifier, I can select a target class and compute the derivative of the input with respect to the target class, and apply the derivative to the input. This puts it closer to my target class.
From the perspective of some models (score models), they produce a derivative of the density (of the samples), so it’s a bit similar to computing a derivative via classifier.
The above was concerned with what the NN was doing.
The algorithm applies the operator a number of steps, and progressively improves the image. In some probabilistic models, you can think of this as an inverse of stochastic gradient descent procedure (meaning a series of steps) that, with some stochasticity, reach a high value region (the density).
However, it turns out that learning this operation doesn’t have to be grounded in probability theory and graphical models.
As long as the NN learns a sufficiently good recovery operator, diffusion will construct something based on the properties of the dataset that has been used.
At no point however are there condensed representations of images since the NN is not learning to produce an image from zero in one step. It merely learns to recover some operation applied to the input.
For the probabilistic view, read Denoising Diffusion Probabilistic Networks and references, in particular langevin dynamics. It includes citations to score models as well.
For the non probabilistic component, read Cold diffusion.
For using the classifier gradient to update an image towards another class, read about adversarial generation via input gradients.
Instead of aiming to reproduce exact replicas, you use a classifier and retrieve the input of the last layer. Do it for both generated and original inputs, and then measure the differences in the statistics.
Wikipedia has a good article on this.
excellent description, thanks
They don't even produce the same image twice from the same description and a different random seed.