A federal judge sides with Anthropic in lawsuit over training AI on books

>>moose4+(OP)
One aspect of this ruling [1] that I find concerning: on pages 7 and 11-12, it concedes that the LLM does substantially "memorize" copyrighted works, but rules that this doesn't violate the author's copyright because Anthropic has server-side filtering to avoid reproducing memorized text. (Alsup compares this to Google Books, which has server-side searchable full-text copies of copyrighted books, but only allows users to access snippets in a non-infringing manner.)

Does this imply that distributing open-weights models such as Llama is copyright infringement, since users can trivially run the model without output filtering to extract the memorized text?

[1]: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...

>>Nobody+fc
Yep, broadly capable open models are on track for annihilation. The cost of legally obtaining all the training materials will require hefty backing.

Additionally that if you download a model file that contains enough of the source material to be considered infringing (even without using the LLM, assume you can extract the contents directly out of the weights) then it might as well be a .zip with a PDF in it, the model file itself becomes an infringing object whereas closed models can be held accountable by not what they store but what they produce.

>>ticula+hf
This technology is a really bad way of storing, reproducing and transmitting the books themselves. It's probabilistic and lossy. It may be possible to reproduce some paragraphs, but no reasonable person would expect to read The Da Vinci Code by prompting the LLM. Surely the marketed use cases and the observed real use by users has to make it clear that the intended and vastly overwhelming use of an LLM is transformative, "digestive" synthesis of many sources to construct a merged, abstracted, generalized system that can function in novel uses, answering never before seen prompts in a useful manner, overwhelmingly without reproducing existing written works. It surely matters what the purpose of the thing is both in intention and observed practice. It's not a viable competing alternative to reading the actual book.

>>bonobo+JN
The number of people who buy Cliffs Notes versions of books to pass examinations where they claim to have read the actual book suggests you are way overestimating how "reasonable" many people are.

>>munifi+bU
Cliff Notes are fair use. Would you argue otherwise? Wikipedia also has plot summaries without infringement.

>>bonobo+hW
In your parent comment, you argued what people would do in practice. Now you have shifted to talking about what is legal or not to do.

I'm not a legal scholar, so I'm not qualified or interested in arguing about whether Cliff Notes is fair use. But I do care about how people behave, and I'm pretty sure that Cliff Notes and LLMs lead to fewer books being purchased, which makes it harder for writers to do what they do.

In the case of Cliff Notes, it probably matters less because because the authors of 19th century books in your English 101 class are long dead and buried. But for authors of newer technical material, yes, I think LLMs will make it harder for those people to be able to afford to spend the time thinking, writing, and sharing their expertise.

>>munifi+N31
It surely matters whether people actually use the thing for copyright violations or not. Summaries are not even copyright violations so that's irrelevant. Long verbatim copies would be, but one would have to demonstrate that this use case is significant, convenient enough to provide a viable alternative to otherwise obtaining the particular text chunk etc.

----

> But for authors of newer technical material, yes, I think LLMs will make it harder for those people to be able to afford to spend the time thinking, writing, and sharing their expertise.

Alright, you're now arguing for some new regulations though, since this is not a matter for copyright.

In that context, I observe that many academics already put their technical books online for free. Machine learning, computer vision, robotics etc. I doubt it's a hugely lucrative thing in the first place.

zlacker