zlacker

[return to "A federal judge sides with Anthropic in lawsuit over training AI on books"]
1. Nobody+fc[view] [source] 2025-06-24 17:29:23
>>moose4+(OP)
One aspect of this ruling [1] that I find concerning: on pages 7 and 11-12, it concedes that the LLM does substantially "memorize" copyrighted works, but rules that this doesn't violate the author's copyright because Anthropic has server-side filtering to avoid reproducing memorized text. (Alsup compares this to Google Books, which has server-side searchable full-text copies of copyrighted books, but only allows users to access snippets in a non-infringing manner.)

Does this imply that distributing open-weights models such as Llama is copyright infringement, since users can trivially run the model without output filtering to extract the memorized text?

[1]: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...

◧◩
2. ticula+hf[view] [source] 2025-06-24 17:46:41
>>Nobody+fc
Yep, broadly capable open models are on track for annihilation. The cost of legally obtaining all the training materials will require hefty backing.

Additionally that if you download a model file that contains enough of the source material to be considered infringing (even without using the LLM, assume you can extract the contents directly out of the weights) then it might as well be a .zip with a PDF in it, the model file itself becomes an infringing object whereas closed models can be held accountable by not what they store but what they produce.

◧◩◪
3. lcnPyl+1J[view] [source] 2025-06-24 20:24:44
>>ticula+hf
> broadly capable open models are on track for annihilation

I'm not so sure about this one. In particular, presuming that it is found that models which can produce infringing material are themselves infringing material, the ability to distill models from older models seems to suggest that the older models can actually produce the new, infringing model. It seems like that should mean that all output from the older model is infringing because any and all of it can be used to make infringing material (the new model, distilled from the old).

I don't think it's really tenable for courts to treat any model as though it is, in itself, copyright-infringing material without treating every generative model like that and, thus, killing the GPT/diffusion generation business (that could happen but it seems very unlikely). They will probably stick to being critical of what people generate with them and/or how they distribute what they generate.

◧◩◪◨
4. ijk+s51[view] [source] 2025-06-24 23:08:40
>>lcnPyl+1J
In theory, couldn't you distill a non-infringing model from an infringing one? Just prompt it for continuations and give it a whack every time the output matches something in your dataset of copyrighted works.

You'd need the copyrighted works to compare to, of course, though if you have the permissible training data (as Anthropic apparently does) it should be doable.

[go to top]