zlacker

“Having copied the five billion images—without the consent of the original artists—Stable Diffusion relies on a mathematical process called diffusion to store compressed copies of these training images, which in turn are recombined to derive other images.”

This seems like it’s not an accurate description of what diffusion is doing. A diffusion model is not the same as compression. They’re implying that Stable Diffusion is taking the entire dataset and making it smaller then storing it. Instead, it’s just learning patterns about the art and replicating those patterns.

The “compression” they’re referring to is the latent space representation which is how Stable Diffusion avoids having to manipulate large images during computation. I mean you could call it a form of compression, but the actual training images aren’t stored using that latent space in the final model afaik. So it's not compressing every single image and storing it in the model.

This page says there were 5 billion images in the stable diffusion training dataset (albeit that may not be true as I see online it’s closer to the 2 billion mark). A Stable Diffusion model is about 5 gb. 5 gb / 5 billion is 1 byte per image. That’s impossible to fit an image in 1 byte. Obviously the claim about it storing compressed copies of the training data is not true. The size of the file comes from the weights in it, not because it’s storing “compressed copies”. In general, it seems this lawsuit is misrepresenting how Stable Diffusion works on a technical level.

replies(2): >>jeroen+1R >>Ephil0+rp1

>>Ephil0+(OP)
If you can put a bunch of large things together into a small file and then later (lossily) extract the large thing out of that smaller file, I'd argue that's compression, yeah. It doesn't really matter if it was intended to be art up as a compression algorithm or not in my opinion. If anything, this approach can be considered a revolution in lossy image compression, even though there's no real market for that at the moment.

If someone finds a way to reverse a hash, I'd also argue that hashing has now become a form of compression.

I think in 5 billion images there are more than enough common image areas to allow for average compression to become lower than a single byte. This is a lossy process, it does not need a complete copy of the source data, similar to how an MP3 doesn't contain most of the audio data fed into it.

I think the argument that SD revolves around lossless compression is quite an interesting one, even if the original code authors didn't realise that's what they were doing. It's the first good technical argument I've heard, at least.

All of those could've been prevented if the model was trained on public domain images instead of random people's copyrighted work. Even if this lawsuit succeeds, I don't think image generation algorithms will be banned. Some AI companies will just have spent a shitton of cash failing to get away with copyright violation, but the technology can still work for art that's either unlicensed or licensed in such a way that AI models can be trained based on it.

replies(2): >>Ephil0+F31 >>angust+9u2

>>jeroen+1R
I still would not call the diffusion process a form of compression. The reason why is because as a whole these models don’t aim to exactly replicate their dataset. If they did, that’s considered overfitting which is a failure of the model (as another commenter said). Generally, these models can almost never be coaxed to give their original data back. To really be considered a form of compression, you’d have to make it easier to do that. Technically, you can do it (e.g. describing a very specific scene in a very specific style), but at that point you’re basically just giving detailed instructions on what to do. If I told a human to paint a very picture and gave them extremely specific steps, that would not be considered compression. That would just be them knowing how existing art patterns work and using that knowledge to follow my instructions. In general, I don’t think it should be considered compression because the results are almost always novel and it’s extremely hard to get anything even close to the original dataset.

>>Ephil0+(OP)
I should clarify a bit about how latent space works as I didn't in the original comment.

Stable diffusion has something called an encoder and decoder. What the encoder does is it takes an image, finds it's fundamental characteristics, and then converts it into a data point (for the sake of simplicity we will use a vector even though it doesn't have to be). Let's say the vector <0.2,0.5,0.6> represents a black dog. If you took a similar vector, you would get another picture of a dog (say a white dog). These vectors are contained in what's called a latent space which is just a collection of items where similar concepts are close together.

Stable Diffusion uses this latent space because it's more computationally efficient. So what it does is it starts with a noisy image which is converted into latent space, then it slowly gets rid of noise. It does this entire process on the latent space representation as opposed to the actual image. This means it's more computationally efficient because it doesn't have to store an entire pixel image in memory. Once it finishes getting rid of the noise, it uses the decoder to convert the image back into a pixel image. What you'll notice is that throughout this entire process it's not just retrieving a compressed image from it's training set and then using it. Instead, it's generating the image through de-noising. This de-noising process is guided by it's understanding of different concepts that can be represented in the latent space.

I think where this lawsuit goes wrong is it implies that the latent space is literally storing a copy of every image in the dataset. As far as I am aware, this is not true. Even though the latent space representations of images are dramatically smaller, it's not small enough to fit the entire dataset in a 5gb file. The only thing Stable Diffusion is storing is the algorithm itself for converting to and from latent space and that's just for computational efficiency as mentioned above. I've heard that Stable Diffusion might store some key concepts from the latent space, but I don't know if that's true or not. Either way, it seems unlikely that the entire dataset is being stored in Stable Diffusion. To me, it seems that saying Stable Diffusion is storing the images themselves is like saying GZIP's algorithm is storing the compressed version of every file in existence.

Disclaimer: Not an ML expert and this is just based on my own understanding of how it works. So I could be wrong

replies(1): >>Ephil0+0X1

>>Ephil0+rp1
Update: As I’ve looked more into the topic the less sure I am now about if I’m correct. I still think that there’s probably little chance that the whole dataset is shipped with stable diffusion. However, I am wondering about if maybe partial examples are shipped with it (e.g. a dictionary of certain concepts) or if there any any other caveats where stable diffusion might contain traces of the original data (note: I don’t think it contains the whole dataset still). I am not an expert so there’s a chance I could be wrong about all of this. So take my words with a grain of salt. Regardless, I still don’t believe the characterization of stable diffusion just copying and pasting images is correct and I believe the lawsuit still is making several factual errors as others online have pointed out.

replies(1): >>Ephil0+no7

>>jeroen+1R
There is a strong, well-understood connection between deep-latent variable models (e.g. VAEs, diffusion models), and compression.

Many state-of-the-art compression algorithms are in fact based on generative models. But the thing is, the model weights themselves are not the compressed representation.

The trained model is the compression algorithm (or more technically, a component of it... as it needs to be combined with some kind of entropy coding).

You could use Stable Diffusion to compress and store the training data if you wanted, but nobody is doing that.

>>Ephil0+0X1
Nevermind, asked someone on a Discord who is more familiar with ML than me. Also, checked a few resources online. As far as I can tell there aren't any traces of the original dataset in Stable Diffusion. There aren't even partial examples of the dataset in there according to the person I talked to. Maybe they're wrong, but I suspect they are right. I did read there is a dictionary for CLIP, but that's a bunch of words that Stable Diffusion can recognize and not saved artwork.

Disclaimer: Not an ML expert