zlacker

It’s likely fair use.

replies(5): >>JCM9+K >>rbultj+34 >>spunke+Bc >>munk-a+Vg >>hn_thr+Kh

>>theGnu+(OP)
Playing back large passages of verbatim content sold as your “product” without citation is almost certainly not fair use. Fair use would be saying “The New York Times said X” and then quoting a sentence with attribution. Thats not what OpenAI is being sued for. They’re being sued for passing off substantial bits of NYTimes content as their own IP and then charging for it saying it’s their own IP.

This is also related to earlier studies about OpenAI where their models have a bad habit of just regurgitating training data verbatim. If your trained data is protected IP you didn’t secure the rights for then that’s a real big problem. Hence this lawsuit. If successful, the floodgates will open.

replies(3): >>ethbr1+c5 >>aragon+x8 >>shkkmo+19

>>theGnu+(OP)
What if a court interprets fair use as a human-only right, just like it did for copyright?

>>JCM9+K
At the root, it seems like there's also a gap in copyright with respect to AI around transformative.

Is using something, in its entirety, as a tiny bit of a massive data set, in order to produce something novel... infringing?

That's a pretty weird question that never existed when copyright was defined.

replies(2): >>layer8+R7 >>bawolf+6c

>>ethbr1+c5
Replace the AI model by a human, and it should become pretty clear what is allowed and what isn’t, in terms of published output. The issue is that an AI model is like a human that you can force to produce copyright-infringing output, or at least where you have little control over whether the output is copyright-infringing or not.

replies(1): >>bawolf+Wc

>>JCM9+K
> They’re being sued for passing off substantial bits of NYTimes content as their own IP and then charging for it saying it’s their own IP.

In what sense are they claiming their generated contents as their own IP?

https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...

> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."

https://openai.com/policies/terms-of-use

> Ownership of Content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.

replies(5): >>_aavaa+xe >>jprete+Ee >>JCM9+wn >>tsimio+lq >>lelant+Su

>>JCM9+K
I would note that in the examples the NYT cites, the prompts explicitly ask for the reproduction of content.

I think it makes sense to hold model makers responsible when their tools make infringement too easy to do or possible to do accidentally. However that is a far cry from requiring a little longer license to do the trainint in the first place.

>>ethbr1+c5
I think it did come up back in the day sort of, for example with libraries.

More importantly, ever case is unique so what really came up was a set of principles for what defines fair use, which will definitely guide this.

>>theGnu+(OP)
> It's likely fair use.

I agree. You can even listen to the NYT Hard Fork podcast (that I recommend btw https://www.nytimes.com/2023/11/03/podcasts/hard-fork-execut...) where they recently had Harvard copyright law professor Rebecca Tushnet on as a guest.

They asked her about the issue of copyrighted training data. Her response was:

""" Google, for example, with the book project, doesn’t give you the full text and is very careful about not giving you the full text. And the court said that the snippet production, which helps people figure out what the book is about but doesn’t substitute for the book, is a fair use.

So the idea of ingesting large amounts of existing works, and then doing something new with them, I think, is reasonably well established. The question is, of course, whether we think that there’s something uniquely different about LLMs that justifies treating them differently. """

Now for my take: Proving that OpenAI trained on NYT articles is not sufficient IMO. They would need to prove that OpenAI is providing a substitutable good via verbatim copying, which I don't think you can easily prove. It takes a lot of prompt engineering and luck to pull out any verbatim articles. It's well-established that LLMs screw up even well-known facts. It's quite hard to accurately pull out the training data verbatim.

replies(1): >>pagane+Bf

>>layer8+R7
Its less clear than you think, and comes down more on how OpenAI is commercially benefiting and competiting with NYT than what they actually did. (See four factors of fair use)

>>aragon+x8
They can’t transfer rights to the output of it isn’t theirs to begin with.

Saying they don’t claim the rights over their output while outputting large chunks verbatim is the old YouTube scheme of upload movie and say “no copyright intended”.

replies(1): >>JCM9+mo

>>aragon+x8
That part doesn't seem relevant to me in any case. IP pirates aren't prosecuted or sued because of a claim of ownership; they're prosecuted or sued over possession, distribution, or use.

>>spunke+Bc
Genuinely asking, is the “verbatim” thing set in stone? I mean, an entity spewing out NYTimes-like articles after having been trained on lots of NYTimes content sounds like a very grey zone, in the “spirit” of copyright law some may judge it as indeed not-lawful.

Of course, I’m not a lawyer and I know that in the US sticking to precedents (which mention the “verbatim” thing) takes a lot of precedence over judging something based on the spirit of the law, but stranger things have happened.

replies(1): >>spunke+Gq

>>theGnu+(OP)
I think we need a lot of clarity here. I think it's perfectly sensible to look at gigantic corpuses of high quality literature as being something society would want to be fair use for training an LLM to better understand and produce more correct writing... but the actual information contained in NYT articles should probably be controlled primarily by NYT. If the value a business delivers (in this case the information of the articles) can be freely poached without limitation by competitors then that business can't afford to actually invest in delivering a quality product.

As a counter argument it might be reasonable to instead say that the NYT delivers "current information" so perhaps it'd be fair to train your model on articles so long as they aren't too recent... but I think a lot of the information that the NYT now relies on for actual traffic is their non-temporal stuff - including things like life advice and recipes.

replies(1): >>tsimio+is

>>theGnu+(OP)
It's likely not. Search for "the four factors of fair use". While I think OpenAI will have decent arguments for 3 of the factors, they'll get killed on the fourth factor, "the effect of the use on the potential market", which is what this lawsuit is really about.

If your "fair use" substantially negatively affects the market for the original source material, which I think is fairly clear in this case, the courts wont look favorably on that.

Of course, I think this is a great test case precisely because the power of "Internet scale" and generative AI is fundamentally different than our previous notions about why we wanted a "fair use exception" in the first place.

replies(2): >>throwu+nm >>graphe+vY

>>hn_thr+Kh
Fair use is based on a flexible proportionality test so they don't need perfect arguments on all factors.

> If your "fair use" substantially negatively affects the market for the original source material, which I think is fairly clear in this case, the courts wont look favorably on that.

I think it's fairly clear that it doesn't. No one is going to use ChatGPT to circumvent NYTimes paywalls when archive.ph and the NoPaywall browser extension exist and any copyright violations would be on the publisher of ChatGPT's content.

But let's not pretend like any of us have any clue what's going to happen in this case. Even if Judge Alsup gets it, we're so far in uncharted territory any speculation is useless.

replies(1): >>hn_thr+1q

>>aragon+x8
The bits you cite are legally bogus.

That would be like me just photocopying a book you wrote and then handing out copies saying we’re assigning different rights to the content. The whole point of the lawsuit is that OpenAI doesn’t own the content and thus they can’t just change the ownership rights per their terms of service. It doesn’t work like that.

replies(1): >>aragon+op

>>_aavaa+xe
Exactly. And while one can easily just take down such a movie if an infringement claim is filed it’s unclear how one “removes” content from a trained model given how these models work. Thats messy.

replies(1): >>_aavaa+nr

>>JCM9+wn
Their legalese is careful to include the 'if any' qualifier ("We hereby assign to you all our right, title, and interest, if any, in and to Output.")

In any case, the point is that they made no claim to Output (as opposed to their code, etc) being their IP.

replies(1): >>tsimio+Pq

>>throwu+nm
> we're so far in uncharted territory any speculation is useless

I definitely agree with that (at least the "far in uncharted territory bit", but as far as "speculation being useless", we're all pretty much just analyzing/guessing/shooting the shit here, so I'm not sure "usefulness" is the right barometer), which is why I'm looking forward to this case, and I also totally agree the assessment is flexible.

But I don't think your argument that it doesn't negatively affect the market holds water. Courts have held in the past that the market for impact is pretty broadly defined, e.g.

> For example, in one case an artist used a copyrighted photograph without permission as the basis for wood sculptures, copying all elements of the photo. The artist earned several hundred thousand dollars selling the sculptures. When the photographer sued, the artist claimed his sculptures were a fair use because the photographer would never have considered making sculptures. The court disagreed, stating that it did not matter whether the photographer had considered making sculptures; what mattered was that a potential market for sculptures of the photograph existed. (Rogers v. Koons, 960 F.2d 301 (2d Cir. 1992).)

From https://fairuse.stanford.edu/overview/fair-use/four-factors/

>>aragon+x8
They are distributing the output, so they (implicitly) claim to have the right to distribute it. I can send you a movie I downloaded along with a license that says "I hereby assign to you all our right, title, and interest, if any, in and to Output. ", I'm still obviously infringing on the copyright of that movie (unless I have a deal that allows re-distribution, of course, as Netflix does).

>>pagane+Bf
There's already precedence for this in news: News outlets constantly report on each other's stories. That's why they care so much about being first on a story, because once they break it, it is fair game for everyone else to report on it too.

Here's a hypothetical: suppose there is a random fact about some news event that has only been reported in a single article. Do they suddenly have a monopoly on that fact, and deserve compensation whenever that fact gets picked up and repeated by other news articles or books or TV shows or movies (or AI models)?

>>aragon+op
That's irrelevant. The main point is that they are re-distributing the content without permission from the copyright owners, so they are sort of implicitly claiming they have copy/distribution rights over it. Since they don't, then it's obvious they can't give you this content at all.

replies(1): >>logicc+fB

>>JCM9+mo
If it’s found that the use of the material is infringing on the rights of the copyright holder than the AI company has to retrain their model without any material they don’t have a right to. Pretty clear to me

replies(1): >>logicc+8B

>>munk-a+Vg
The case for copyright is exactly the opposite: the form of content (the precise way the NYT writers presented it) is protected. The ideas therein, the actual news story, is very much not protected at all. You can freely and legally read an NYT article hot off the press and go on air on Fox News and recount it, as long as you're not copying their exact words. Even if the news turns out to be entirely fake and invented by the NYT to catch you leaking their stuff, you still have every right to present the information therein.

This isn't even "fair use". The ideas in a work are simply not protected by copyright, only the form is.

>>aragon+x8
> In what sense are they claiming their generated contents as their own IP?

https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...

>> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."

How are they giving you the rights to the work if they don't own it? They are literally asserting that they are in a position to assign the rights (to the output) to the user - that is a literal claim of ownership.

IOW, if someone says "Take this from me, I assure you it is legal to do so", they are asserting ownership of that thing.

>>_aavaa+nr
By that logic Microsoft Word should have to refuse to save or print any text that contained copyrighted content. GPT is just a tool; the user who's asking it to produce copyrighted content (and then publishing that content) is the one violating the copyright, and they're the ones who should be liable.

replies(1): >>_aavaa+5F

>>tsimio+Pq
>The main point is that they are re-distributing the content without permission from the copyright owners,

By your logic, Firefox is re-distributing content without permission from the copyright owners whenever you use it to read a pirated book. ChatGPT isn't just randomly generating copyrighted content, it just does so when explicitly prompted by a user.

replies(1): >>tsimio+aD

>>logicc+fB
That is not the same thing at all. If I search on Google for copyrighted content and Google shows me the content, it is the server which serves the content who is most directly responsible, not Google nor I. Firefox is only a neutral agent, whereas ChatGPT is the source of the copyrighted content.

Of course, if the input I give to ChatGPT is "here is a piece from an NYT aricle, please tell it to me again verbatim", followed by a copy I got from the NYT archive, and ChatGPT is returning the same text I gave it as input, that is not copyright infringement. But if I say "please show me the text of the NYT article on crime from 10th January 1993", and ChatGPT returns the exact text of that article, then they are obviously infringing on NYT's distribution rights for this content, since they are retrieving it from their own storage.

If they returned a link you could click, t and retrieved the content from the NYT, along with any other changes such as advertising, even if it were inside an iframe, it would be an entirely different matter.

>>logicc+8B
I don’t even know where to begin on this example.

The situations aren’t remotely similar and that much should be obvious. In one instance ChatGPT is reproducing copyrighted work and in the other Word is taking keyboard input from the user; Word itself isn’t producing anything itself.

> GPT is just a tool.

I don’t know what point this is supposed to make. It is not “just a tool” in the sense that it has no impact on what gets written.

Which brings us back to the beginning.

> the user who’s asking it to produce copyrighted content.

ChatGPT was trained on copyrighted content. The fact that it CAN reproduce the copyrighted content and the fact that it was trained on it is what the argument is about.

>>hn_thr+Kh
Nobody is gonna cancel their NYT subscription for chatGPT 4.0. OpenAI will win.

replies(1): >>hn_thr+5c1

>>graphe+vY
Per my other comment here, >>38784723 , courts have previously ruled that whether people would cancel their NYT subscription is irrelevant to that test.

replies(1): >>graphe+vk1

>>hn_thr+5c1
What exactly is the effect on the potential market? That's exactly why I don't think OpenAI will lose, why would a court side with the NYT?