zlacker

The difference is an image generation algorithm does not consume images the way a human does, nor reproduce them that way. If you show a human several Rembrandt's and ask them to duplicate them, you won't get exact copies, no matter how brilliant the human is: the human doesn't know how Rembrandt painted, and especially if you don't permit them to keep references, you won't get the exact painting: you'll get the elements of the original that most stuck out to them, combined with an ethereal but detectable sense of their original tastes leaking through. That's how inspiration works.

If on the other hand you ask an image generator for a Rembrandt, you'll get several usable images, and good odds a few them will be outright copies, and decent odds a few of them will be configured into an etsy or ebay product image despite you not asking for that. And the better the generator is, the better it's going to do at making really good Rembrandt style paintings, which ironically, increases the odds of it just copying a real one that appeared many times in it's training data.

People try and excuse this with explanations about how it doesn't store the images in it's model, which is true, it doesn't. However if you have a famous painting by any artist, or any work really, it's going to show up in the training data many, many times, and the more popular the artist, the more times it's going to be averaged. So if the same piece appears in lots and lots of places, it creates a "rut" in the data if you will, where the algorithm is likely going to strike repeatedly. This is why it's possible to get full copied artworks out of image generators with the right prompts.

replies(2): >>chii+G8 >>HanCli+O9

>>Toucan+(OP)
> with the right prompts.

that is doing a lot of pull. Just because you could "get the full copies" with the right prompts, doesn't mean the weights and the training is copyright infringement.

I could also get a full copy of any works out of the digits of pi.

The point i would like to emphasize is that the using data to train the model is not copyright infringement in and of itself. If you use the resulting model to output a copy of an existing work, then this act constitutes copyright infringement - in the exact same way that using photoshop to reproduce some works is.

What a lot of anti-ai arguments are trying to achieve is to make the act of training and model making the infringing act, and the claim is that the data is being copied while training is happening.

replies(1): >>DrScie+Kf

>>Toucan+(OP)
We have the problem of too-perfect-recall with humans too -- even beyond artists with (near) photographic memory, there's the more common case of things like reverse-engineering.

At times, developers on projects like WINE and ReactOS use "clean-room" reverse-engineering policies [0], where -- if Developer A reads a decompiled version of an undocumented routine in a Windows DLL (in order to figure out what it does), then they are now "contaminated" and not eligible to write the open-source replacement for this DLL, because we cannot trust them to not copy it verbatim (or enough to violate copyright).

So we need to introduce a barrier of safety, where Developer A then writes a plaintext translation of the code, describing and documenting its functionality in complete detail. They are then free to pass this to someone else (Developer B) who is now free to implement an open-source replacement for that function -- unburdened by any fear of copyright violation or contamination.

So your comment has me pondering -- what would the equivalent look like (mathematically) inside of an LLM? Is there a way to do clean-room reverse-engineering of images, text, videos, etc? Obviously one couldn't use clean-room training for _everything_ -- there must be a shared context of language at some point between the two Developers. But you have me wondering... could one build a system to train an LLM from copywritten content in a way that doesn't violate copyright?

[0]: https://en.wikipedia.org/wiki/Clean-room_design

>>chii+G8
>The point i would like to emphasize is that the using data to train the model is not copyright infringement in and of itself.

Interesting point - though the law can be strange in some cases - so for example in the UK in court cases where people are effectively being charged for looking at illegal images, the actual crime can be 'making illegal images' - simply because a precedence has been set that because any OS/Browser has to 'copy' the data of any image in order someone to be able to view it - any defendent has been deemed to copied it.

Here's an example. https://www.bbc.com/news/articles/cgm7dvv128ro

So to ingest something your training model ( view ) you have by definition have had to have copied it to your computer.

replies(1): >>xp84+yr3

>>DrScie+Kf
That seems to be an artifact of the whole copyright thing predating all forms of computing and memory, but if we don’t ignore that one, we’ve all been illegally copying copyrighted text, images and videos into our RAM every time we use the Internet. So i think the courts now basically acknowledge that that doesnt count as a “copy.”

*Not a lawyer

replies(1): >>DrScie+Qi6

>>xp84+yr3
Expect I've given you a concrete real counter example of where they do treat copying in memory as 'making a copy'.