This is also related to earlier studies about OpenAI where their models have a bad habit of just regurgitating training data verbatim. If your trained data is protected IP you didn’t secure the rights for then that’s a real big problem. Hence this lawsuit. If successful, the floodgates will open.
Is using something, in its entirety, as a tiny bit of a massive data set, in order to produce something novel... infringing?
That's a pretty weird question that never existed when copyright was defined.
In what sense are they claiming their generated contents as their own IP?
https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...
> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."
https://openai.com/policies/terms-of-use
> Ownership of Content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.
I think it makes sense to hold model makers responsible when their tools make infringement too easy to do or possible to do accidentally. However that is a far cry from requiring a little longer license to do the trainint in the first place.
More importantly, ever case is unique so what really came up was a set of principles for what defines fair use, which will definitely guide this.
I agree. You can even listen to the NYT Hard Fork podcast (that I recommend btw https://www.nytimes.com/2023/11/03/podcasts/hard-fork-execut...) where they recently had Harvard copyright law professor Rebecca Tushnet on as a guest.
They asked her about the issue of copyrighted training data. Her response was:
""" Google, for example, with the book project, doesn’t give you the full text and is very careful about not giving you the full text. And the court said that the snippet production, which helps people figure out what the book is about but doesn’t substitute for the book, is a fair use.
So the idea of ingesting large amounts of existing works, and then doing something new with them, I think, is reasonably well established. The question is, of course, whether we think that there’s something uniquely different about LLMs that justifies treating them differently. """
Now for my take: Proving that OpenAI trained on NYT articles is not sufficient IMO. They would need to prove that OpenAI is providing a substitutable good via verbatim copying, which I don't think you can easily prove. It takes a lot of prompt engineering and luck to pull out any verbatim articles. It's well-established that LLMs screw up even well-known facts. It's quite hard to accurately pull out the training data verbatim.
Saying they don’t claim the rights over their output while outputting large chunks verbatim is the old YouTube scheme of upload movie and say “no copyright intended”.
Of course, I’m not a lawyer and I know that in the US sticking to precedents (which mention the “verbatim” thing) takes a lot of precedence over judging something based on the spirit of the law, but stranger things have happened.
As a counter argument it might be reasonable to instead say that the NYT delivers "current information" so perhaps it'd be fair to train your model on articles so long as they aren't too recent... but I think a lot of the information that the NYT now relies on for actual traffic is their non-temporal stuff - including things like life advice and recipes.
If your "fair use" substantially negatively affects the market for the original source material, which I think is fairly clear in this case, the courts wont look favorably on that.
Of course, I think this is a great test case precisely because the power of "Internet scale" and generative AI is fundamentally different than our previous notions about why we wanted a "fair use exception" in the first place.
> If your "fair use" substantially negatively affects the market for the original source material, which I think is fairly clear in this case, the courts wont look favorably on that.
I think it's fairly clear that it doesn't. No one is going to use ChatGPT to circumvent NYTimes paywalls when archive.ph and the NoPaywall browser extension exist and any copyright violations would be on the publisher of ChatGPT's content.
But let's not pretend like any of us have any clue what's going to happen in this case. Even if Judge Alsup gets it, we're so far in uncharted territory any speculation is useless.
That would be like me just photocopying a book you wrote and then handing out copies saying we’re assigning different rights to the content. The whole point of the lawsuit is that OpenAI doesn’t own the content and thus they can’t just change the ownership rights per their terms of service. It doesn’t work like that.
In any case, the point is that they made no claim to Output (as opposed to their code, etc) being their IP.
I definitely agree with that (at least the "far in uncharted territory bit", but as far as "speculation being useless", we're all pretty much just analyzing/guessing/shooting the shit here, so I'm not sure "usefulness" is the right barometer), which is why I'm looking forward to this case, and I also totally agree the assessment is flexible.
But I don't think your argument that it doesn't negatively affect the market holds water. Courts have held in the past that the market for impact is pretty broadly defined, e.g.
> For example, in one case an artist used a copyrighted photograph without permission as the basis for wood sculptures, copying all elements of the photo. The artist earned several hundred thousand dollars selling the sculptures. When the photographer sued, the artist claimed his sculptures were a fair use because the photographer would never have considered making sculptures. The court disagreed, stating that it did not matter whether the photographer had considered making sculptures; what mattered was that a potential market for sculptures of the photograph existed. (Rogers v. Koons, 960 F.2d 301 (2d Cir. 1992).)
From https://fairuse.stanford.edu/overview/fair-use/four-factors/
Here's a hypothetical: suppose there is a random fact about some news event that has only been reported in a single article. Do they suddenly have a monopoly on that fact, and deserve compensation whenever that fact gets picked up and repeated by other news articles or books or TV shows or movies (or AI models)?
This isn't even "fair use". The ideas in a work are simply not protected by copyright, only the form is.
https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...
>> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."
How are they giving you the rights to the work if they don't own it? They are literally asserting that they are in a position to assign the rights (to the output) to the user - that is a literal claim of ownership.
IOW, if someone says "Take this from me, I assure you it is legal to do so", they are asserting ownership of that thing.
By your logic, Firefox is re-distributing content without permission from the copyright owners whenever you use it to read a pirated book. ChatGPT isn't just randomly generating copyrighted content, it just does so when explicitly prompted by a user.
Of course, if the input I give to ChatGPT is "here is a piece from an NYT aricle, please tell it to me again verbatim", followed by a copy I got from the NYT archive, and ChatGPT is returning the same text I gave it as input, that is not copyright infringement. But if I say "please show me the text of the NYT article on crime from 10th January 1993", and ChatGPT returns the exact text of that article, then they are obviously infringing on NYT's distribution rights for this content, since they are retrieving it from their own storage.
If they returned a link you could click, t and retrieved the content from the NYT, along with any other changes such as advertising, even if it were inside an iframe, it would be an entirely different matter.
The situations aren’t remotely similar and that much should be obvious. In one instance ChatGPT is reproducing copyrighted work and in the other Word is taking keyboard input from the user; Word itself isn’t producing anything itself.
> GPT is just a tool.
I don’t know what point this is supposed to make. It is not “just a tool” in the sense that it has no impact on what gets written.
Which brings us back to the beginning.
> the user who’s asking it to produce copyrighted content.
ChatGPT was trained on copyrighted content. The fact that it CAN reproduce the copyrighted content and the fact that it was trained on it is what the argument is about.