zlacker

[parent] [thread] 17 comments
1. fallin+(OP)[view] [source] 2023-12-27 14:18:35
What are they arguing here? AFAIK reading copyrighted works is not copyright infringement. Copying and selling them is, as the name would suggest, but OpenAI absolutely did not do that. Are they trying to say that LLM training is a special type of reading that should be considered infringement? Seems like a weak case to me.

edit: Would be very funny if OpenAI used an educational fair use defense

replies(6): >>NN88+l >>zozbot+y >>the-rc+B >>eigenk+X >>cowsup+g1 >>tsimio+r4
2. NN88+l[view] [source] 2023-12-27 14:20:26
>>fallin+(OP)
>AFAIK reading copyrighted works

I hope you don’t think that’s all whats happening, right?

>LLM training is a special type of reading that should be considered infringement

OK, what turn of phrase would you prefer?

replies(2): >>fallin+U1 >>Ringz+r3
3. zozbot+y[view] [source] 2023-12-27 14:21:24
>>fallin+(OP)
The article mentions that ChatGPT will absolutely parrot back NYTimes article text verbatim. So yes, it's copyright infringement.
replies(1): >>lagnia+42
4. the-rc+B[view] [source] 2023-12-27 14:21:29
>>fallin+(OP)
If you read the complaint, you will see that, among others, paragraphs and paragraphs of NYT articles are reproduced verbatim or almost verbatim.
replies(1): >>aurizo+13
5. eigenk+X[view] [source] 2023-12-27 14:23:38
>>fallin+(OP)
The second paragraph of the article is

> As outlined in the lawsuit, the Times alleges OpenAI and Microsoft’s large language models (LLMs), which power ChatGPT and Copilot, “can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style.” This “undermine[s] and damage[s]” the Times’ relationship with readers, the outlet alleges, while also depriving it of “subscription, licensing, advertising, and affiliate revenue.”

replies(1): >>fallin+33
6. cowsup+g1[view] [source] 2023-12-27 14:24:59
>>fallin+(OP)
> AFAIK reading copyrighted works is not copyright infringement. [...] Are they trying to say that LLM training is a special type of reading that should be considered infringement?

Nobody can argue that OpenAI was feeding the content to ChatGPT because ChatGPT was bored or was curious about current events. It was fed NYT's content so it would know how to reproduce similar content, for profit.

I think getting a case-law in the books as to what is legal, and what is not, with LLMs, was inevitable. If it wasn't NYT suing ChatGPT, it would be another publisher, or another artist, whose work was used to "train" these systems.

replies(2): >>Baldbv+h3 >>laweij+J3
◧◩
7. fallin+U1[view] [source] [discussion] 2023-12-27 14:29:24
>>NN88+l
You could definitely argue that it's more than just reading since they made the model out of it. But the matrix of parameters generated by training is so fundamentally different than the input that it is certainly covered by the transformative use exception to copyright.
◧◩
8. lagnia+42[view] [source] [discussion] 2023-12-27 14:30:16
>>zozbot+y
Sections of this statement absolutely parrot back NYTimes article text vebatim depending how you look at it. What's the line? 3 sequential verbatim words? 5? 8?
replies(1): >>noitpm+c4
◧◩
9. aurizo+13[view] [source] [discussion] 2023-12-27 14:35:37
>>the-rc+B
Sounds like the infinite monkey typewriter thing, where the NYT sieves the OpenAI output for exact segments, probably after primining it to fatten the yield
◧◩
10. fallin+33[view] [source] [discussion] 2023-12-27 14:35:40
>>eigenk+X
> closely summarizes it

Absolutely not copyright infringement

> mimics its expressive style

Absolutely not copyright infringement

> can generate output that recites Times content verbatim

This one seems the closest to infringement, but still doesn't seem like infringement. A printer has this capability too. If a user told ChatGPT to recite NYT content and then sold that content, that would be 100% infringement, but would probably be on the user, not the tool. e.g. if someone printed out NYT articles and sold them, nobody would come after the printer manufacturer.

> undermine[s] and damage[s]” the Times’ relationship with readers, the outlet alleges, while also depriving it of “subscription, licensing, advertising, and affiliate revenue.

This claim seems far fetched as the point of the NYT is to report the news. One thing that LLMs absolutely cannot do is report today's news. I can see no way that ChatGPT is a substitute for the NYT in a way that violates copyright.

replies(1): >>kevind+g6
◧◩
11. Baldbv+h3[view] [source] [discussion] 2023-12-27 14:37:04
>>cowsup+g1
> ... for profit.

for non-profit

replies(1): >>cowsup+Wt
◧◩
12. Ringz+r3[view] [source] [discussion] 2023-12-27 14:37:56
>>NN88+l
Of course, OpenAI doesn't just read. But do they simply reproduce content verbatim? And when they do not reproduce contents of the NYT verbatim, but rather process and tailor them to the situation, mixed, 'charged' with new 'information content' and adjusted to the purpose of the inquirer, it will not be easy for the New York Times.

Because ultimately, our entire knowledge is based on the knowledge of others and is remixed, 'charged' and changed by us after reading. I also think that the New York Times uses the contents of others to create new content.

◧◩
13. laweij+J3[view] [source] [discussion] 2023-12-27 14:39:08
>>cowsup+g1
> It was fed NYT's content so it would know how to reproduce similar content, for profit.

Sounds like journalism school?

◧◩◪
14. noitpm+c4[view] [source] [discussion] 2023-12-27 14:42:25
>>lagnia+42
We'll find out won't we ;)

You have to imagine these limits are already fairly known within the legal community... If you're accused of copying/republishing my published work there will be some minimal threshold of similarity I would need to prove in order to seek damages.

15. tsimio+r4[view] [source] 2023-12-27 14:43:41
>>fallin+(OP)
It should be noted that there are explicit exemptions to allow copying program data intro RAM and into CPU registers (in many licenses). Whether that is truly necessary or not is at best debatable, but arguably training a model (especially one you then distribute or give access to) on copyrighted data is vastly different from regular copying into memory and should require explicit licensing.

The fact that the model can reproduce large chunks of the original text verbatim is proof positive that it contains copies of the original text encoded in its weights. If I wrote a program that crawled the NYT site, zipping the contents, and retrieved articles based on keyword searches and made them available online, would you not say I'm infringing their copyright?

◧◩◪
16. kevind+g6[view] [source] [discussion] 2023-12-27 14:53:30
>>fallin+33
I'm in agreement, but this line is not quite an accurate metaphor:

> e.g. if someone printed out NYT articles and sold them, nobody would come after the printer manufacturer.

If the printer manufacturer had a product that could take one sentence and it would print multiple pages that complete a news article from that sentence, ...

◧◩◪
17. cowsup+Wt[view] [source] [discussion] 2023-12-27 17:04:39
>>Baldbv+h3
1. Non-profit != "not making a profit." A non-profit can still earn monetary profit, and many do.

2. The non-profit OpenAI, Inc. company is not to be confused with the for-profit OpenAI GP, LLC [0] that it controls. OpenAI was solely a non-profit from 2015-2019, and, in 2019, the for-profit arm was created, prior to the launch of ChatGPT. Microsoft has a significant investment in the for-profit company, which is why they're included in this lawsuit.

[0] https://openai.com/our-structure

replies(1): >>Baldbv+dx
◧◩◪◨
18. Baldbv+dx[view] [source] [discussion] 2023-12-27 17:24:00
>>cowsup+Wt
I know all that. But who did the training?
[go to top]