zlacker

There are two problems with the “kid” analogy:

a) In many closely comparable scenarios, yes, it’s copyright infringement. When Francis Ford Coppola made The Godfather film, he couldn’t just be “inspired” by Puzo’s book. If the story or characters or dialog are similar enough, he has to pay Puzo, even if the work he created was quite different and not a literal “copy”.

b) Training an LLM isn’t like giving someone a book. Among other things, it involves making a derivative copy into GPU memory. This copy is not a transitory copy in service of a fair use, nor likely a fair use in itself, nor licensed by the rights-holder.

replies(5): >>andy99+64 >>PaulDa+Mm >>random+ED >>EarthM+UF >>fennec+wk2

>>twoodf+(OP)
> This copy is not a transitory copy in service of a fair use

Training is almost certainly fair use, so it's exactly a transitory copy in service of fair use. Training, other than the brief "transitory copy" you mention is not copying, it's making a minuscule algorithmic adjustment based on fleeting exposure to the data.

replies(2): >>twoodf+1m >>edwint+me1

>>andy99+64
Why is training “almost certainly” fair use?

Congress took the circuit holding in MAI Systems seriously enough to carve out a new fair use exception for copying software—entirely within the memory system of a licensed user—in service of debugging it.

If it took an act of Congress to make “unlicensed” debugging a fair use copy…

>>twoodf+(OP)
Regarding (b) ... while a specific method of training that involved persistent copying may indeed be a violation, it is far from clear that the general notion of "send server request for URL, digest response in software that is not a browser" is automatically a violation. If there is deemed to be a difference (i.e. all you are allowed to do without a license is have a human read it in a browser), then one can see training mechanisms changing to accomodate that.

replies(1): >>twoodf+Ym

>>PaulDa+Mm
It’s all about the purpose the transitory copy serves. The mechanism doesn’t really matter, so you can’t make categorical claims about (say) non-browser requests.

>>twoodf+(OP)
>This copy is not a transitory copy in service of a fair use, nor likely a fair use in itself,

Seems vastly transitory and since the output cannot be copyrighted, does no harm to any work it “trained” on.

>>twoodf+(OP)
> If the story or characters or dialog are similar enough, he has to pay Puzo, even if the work he created was quite different and not a literal “copy”.

I don't think that you can copyright a plot or story in any country can you?

If he re-wrote the story with different characters and different lines he wouldn't have had to to pay Puzo. I'm sure it would have been frowned upon if its too close, but legally ok.

>>andy99+64
If you overtrain the model may include verbatim copies of your training material, and may be able to produce verbatim copies of the original in its output.

If Microsoft truly believes that the trained output doesn't violate copyright then it should be forced to prove that by training it on all its internal source code, including Windows.

>>twoodf+(OP)
How is it a copy at all? Surely the model weights would therefore be much larger than the corpus of training data, which is not the case at all.

If it disgorges parts of NYT articles, how do we know this is not a common phrase, or the article isn't referenced verbatim on another, unpaid site?

I agree that if it uses the whole content of their articles for training, then NYT should get paid, but I'm not sure that they specifically trained on "paid NYT articles" as a topic, though I'm happy to be corrected.

I also think that companies and authors extremely overvalue the tiny fragments of their work in the huge pool of training data, I think there's a bit of a "main character" vibe going on.