Interesting that they mention collages. IANAL but it was my impression that collages are derivative work if they incorporate many different pieces and only small parts of the original. Their compression argument seems more convincing.
You run into the pigeonhole argument. That level of compression can only work if there are less than seventy thousand different images in existence, total.
Certainly there’s a deep theoretical equivalent between intelligence and compression, but this scenario isn’t what anyone means by “compression” normally.
Just like gzip, training stable diffusion certainly removes a lot of data, but without understanding the effect of that transformation of the entropy of the data it's meaningless to say thing like "two bytes per image" because(like gzip) you need the whole encoded dataset to recover the image.
It's compressing many images into 10GB of data, not a single image into two bytes. This is directly analogous to what people usually mean by "compression"