This seems like it’s not an accurate description of what diffusion is doing. A diffusion model is not the same as compression. They’re implying that Stable Diffusion is taking the entire dataset and making it smaller then storing it. Instead, it’s just learning patterns about the art and replicating those patterns.
The “compression” they’re referring to is the latent space representation which is how Stable Diffusion avoids having to manipulate large images during computation. I mean you could call it a form of compression, but the actual training images aren’t stored using that latent space in the final model afaik. So it's not compressing every single image and storing it in the model.
This page says there were 5 billion images in the stable diffusion training dataset (albeit that may not be true as I see online it’s closer to the 2 billion mark). A Stable Diffusion model is about 5 gb. 5 gb / 5 billion is 1 byte per image. That’s impossible to fit an image in 1 byte. Obviously the claim about it storing compressed copies of the training data is not true. The size of the file comes from the weights in it, not because it’s storing “compressed copies”. In general, it seems this lawsuit is misrepresenting how Stable Diffusion works on a technical level.
If someone finds a way to reverse a hash, I'd also argue that hashing has now become a form of compression.
I think in 5 billion images there are more than enough common image areas to allow for average compression to become lower than a single byte. This is a lossy process, it does not need a complete copy of the source data, similar to how an MP3 doesn't contain most of the audio data fed into it.
I think the argument that SD revolves around lossless compression is quite an interesting one, even if the original code authors didn't realise that's what they were doing. It's the first good technical argument I've heard, at least.
All of those could've been prevented if the model was trained on public domain images instead of random people's copyrighted work. Even if this lawsuit succeeds, I don't think image generation algorithms will be banned. Some AI companies will just have spent a shitton of cash failing to get away with copyright violation, but the technology can still work for art that's either unlicensed or licensed in such a way that AI models can be trained based on it.