A federal judge sides with Anthropic in lawsuit over training AI on books

>>moose4+(OP)
One aspect of this ruling [1] that I find concerning: on pages 7 and 11-12, it concedes that the LLM does substantially "memorize" copyrighted works, but rules that this doesn't violate the author's copyright because Anthropic has server-side filtering to avoid reproducing memorized text. (Alsup compares this to Google Books, which has server-side searchable full-text copies of copyrighted books, but only allows users to access snippets in a non-infringing manner.)

Does this imply that distributing open-weights models such as Llama is copyright infringement, since users can trivially run the model without output filtering to extract the memorized text?

[1]: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...

>>Nobody+fc
Yep, broadly capable open models are on track for annihilation. The cost of legally obtaining all the training materials will require hefty backing.

Additionally that if you download a model file that contains enough of the source material to be considered infringing (even without using the LLM, assume you can extract the contents directly out of the weights) then it might as well be a .zip with a PDF in it, the model file itself becomes an infringing object whereas closed models can be held accountable by not what they store but what they produce.

>>ticula+hf
> a model file that contains enough of the source material to be considered infringing

The amount of the source material encoded does not, alone, determine if it is infringing, so this noun phrase doesn't actually mean anything. I know there are some popular myths that contradict this (the commonly-believed "30-second rule" for music, for instance), but they are just that, myths.

>>dragon+kL
But there is the issue of whether there are damages. If my LLM can reproduce 10 random paragraphs of a Harry Potter book, it's obvious that nobody would have otherwise purchased the book if they couldn't read those 10 paragraphs. So there will not be any damages to the publisher and the lawsuit will be tossed. There is a threshold of how much of it needs to be reproduced, and how closely, but it's a subjective standard and not some hard line like if it's > 50%.

>>fallin+hS
> But there is the issue of whether there are damages.

Not if there isn't infringement. Infringement is a question that precedes damages, since "damages" are only those harms that are attributable to the infringement. And infringement is an act, not an object.

If training a general use LLM on books isn't infringement (as this decision holds), then there by definition cannot be damages stemming from it; the amount of the source material that the model file "contains" doesn't matter.

It might matter to whether it is possible for a third party to easily use the model for something that would be infringement on the part of the third party, but that would become a problem for people who use it for infringement, not the model creator, and not for people who simply possess a copy of the model. The model isn't "an infringing object".

zlacker