Millions? Damn, they can churn out some content. 13 million[0]!.
[0] https://archive.nytimes.com/www.nytimes.com/ref/membercenter....
(That's a story about Jayson Blair, one of their reporters who just plain made up stories and sources for months before getting caught)
Edit: Sheesh, even their apology is paywalled. Wiki background: https://en.wikipedia.org/wiki/Jayson_Blair?wprov=sfla1
In most respected media companies there is a really-important-to-journalists-who-work-there firewall between these sorts of corporate battles and the reporting on them.
Certainly, but debating the spirit behind copyright or even "how to regulate AI" (a vast topic, to put it mildly) is only one possible route these lawsuits could take.
I suspect that ultimately the winner is going to be business first (of course in the name of innovation), and the law second, and ethics coming last -- if Google can scan 129 million books [1] and store them without even a slap on the wrist [2], OpenAI and anyone of that size can most surely continue to do what they're doing. This lawsuit and others like it are just the drama of 'due process'.
[1] https://booksearch.blogspot.com/2010/08/books-of-world-stand... [2] https://www.reuters.com/article/idUSBRE9AD0TT/
In one example of how A.I. systems use The Times’s material, the suit showed that Browse With Bing, a Microsoft search feature powered by ChatGPT, reproduced almost verbatim results from Wirecutter, The Times’s product review site. The text results from Bing, however, did not link to the Wirecutter article, and they stripped away the referral links in the text that Wirecutter uses to generate commissions from sales based on its recommendations.
[1] https://www.nytimes.com/2023/12/27/business/media/new-york-t...
I am not saying that the NY Times is a CIA asset but from the crap they have printed in the past like the whole WMDs in Iraq saga and the puff piece of Elizabeth Holmes they are far from a completely independent and propaganda free paper. Henry Kissinger would call the paper and have his talking point printed the next day regarding Vietnam. [1]
There is a huge conflict of access to government officials and independents of papers.
Honestly, I get this feeling about these lawsuits about using content to train LLMs.
Think of it this way: in growing up and learning to read and getting an education you read any number of books, articles, Web pages, magazines, etc. You viewed any number of artworks, buildings, cars, vehicles, furniture, etc, many of which might have design patents. We have such silliness as it being illegal to distribute photos commercially of the Eiffel Tower at night [2].
What's the differnce between training a model on text and images and educating a person with text and images, really? If I read too many NYT articles, am I going to get sued for using too much "training data"?
Currently we need copious quantities of training data for LLMs. I believe this is because we're in the early days of this tech. I mean no person has read millions of articles or books. At some point models will get better with substantially smaller training sets. And then, how many articles is too many as far as these suits go?
[1]: https://en.wikipedia.org/wiki/Wright_brothers_patent_war
[2]: https://www.travelandleisure.com/photography/illegal-to-take...
About a fifth to a quarter of public-facing Web servers are Windows Server. Most famously, Stack Overflow[1].
Got a link for that? Best I can find is 5% of all websites: https://www.netcraft.com/blog/may-2023-web-server-survey/
https://law.justia.com/cases/federal/appellate-courts/ca2/13...
https://en.wikipedia.org/wiki/Sackler_family
The family has been largely successful at avoiding any personal liability in Purdue’s litigations. Many people feel the settlements of the Purdue lawsuits were too lenient. One of the key perceived aspects of the final settlements was that there was too many victims of the opioid epidemic for the courts to handle and attempt to make whole.
https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...
They state that "with minimal prompting", ChatGPT will "recite large portions" of some of their articles with only small changes.
I wonder why they don't sue the wayback machine first. You can get the whole article on the wayback machine. Not just portions. And not with small changes but verbatim. And you don't need any special prompting. As soon as you are confronted with a paywall window on the times websites, all you need to do is to go to the wayback machine, paste the url and you can read it.
There are massive number of piracy content in China, but Hollywood are also making billions in the same time, and in fact China already surpassed NA as #1 market for Hollywood years ago [1].
NYT is obvious different than Disney, and may not be able to bend their knees far enough, but maybe there can be similar ways out of this.
[1] https://www.theatlantic.com/culture/archive/2021/09/how-holl...
No piracy or even AI was required, here. Google's defense was that their product couldn't reproduce the book in it's entirety, which was proven and made the prosecution about Fair Use instead. Given that it was much harder to prosecute on those grounds, Google tried coercing the authors into a settlement before eventually the District Court dropped the case in Google's favor altogether.
OpenAI's lawyers are aware of the precedent on copyright law. They're going to argue their application is Fair Use, and they might get away with it.
[1]: https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...
They should have thought of that before they went ahead and trained on whatever they could get.
Image models are going to have similar problems, even if they win on copyright there's still CSAM in there: https://www.theregister.com/2023/12/20/csam_laion_dataset/
E.g. "Japan's App Store antitrust case"
https://www.perplexity.ai/search/Japans-App-Store-GJNTsIOVSy...
Would it be more rigorous for AI to cite its sources? Sure, but the same could be said for humans too. Wikipedia editors, scholars, and scientists all still struggle with proper citations. NYT itself has been caught plagiarizing[1].
But that doesn't really solve the underlying issue here: That our copyright laws and monetization models predate the Internet and the ease of sharing/paywall bypass/piracy. The models that made sense when publishing was difficult and required capital-intensive presses don't necessarily make sense in the copy and paste world of today. Whether it's journalists or academics fighting over scraps just for first authorship (while some random web dev makes 3x more money on ad tracking), it's just not a long-term sustainable way to run an information economy.
I'd also argue that attribution isn't really that important to most people to begin with. Stuff, real and fake, gets shared on social media all the time with limited fact-checking (for better or worse). In general, people don't speak in a rigorous scholarly way. And people are often wrong, with faulty memories, or even incentivized falsehoods. Our primate brains aren't constantly in fact-checking mode and we respond better to emotional, plot-driven narratives than cold statistics. There are some intellectuals who really care deeply about attributions, but most humans won't.
Taken the above into consideration:
1) Useful AI does not necessarily require attribution
2) AI piracy is just a continuation of decades of digital piracy, and the solutions that didn't work in the 1990s and 2000s still won't work against AI
3) We need some better way to fund human creativity, especially as it gets more and more commoditized
4) This is going to happen with or without us. Cat's outta the bag.
I don't think using old IP law to hold us back is really going to solve anything in the long term. Yes, it'd be classy of OpenAI to pay everyone it sourced from, but long term that doesn't matter. Creativity has always been shared and copied and imitated and stolen, the only question is whether the creators get compensated (or even enriched) in the meantime. Sometimes yes, sometimes no, but it happens regardless. There'll always be noncommercial posts by the billions of people who don't care if AI, or a search engine, or Twitter, or whoever, profits off them.
If we get anywhere remotely close to AGI, a lot of this won't matter. Our entire economic and legal systems will have to be redone. Maybe we can finally get rid of the capitalist and lawyer classes. Or they'll probably just further enslave the rest of us with the help of their robo-bros, giving AI more rights than poor people.
But either way, this is way bigger than the economics of 19th-century newspapers...
[1] https://en.wikipedia.org/wiki/Jayson_Blair#Plagiarism_and_fa...
Apple is already doing this: https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...
Apple caught a lot of shit over the past 18 months for their lack of AI strategy; but I think two years from now they're going to look like geniuses.
In what sense are they claiming their generated contents as their own IP?
https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...
> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."
https://openai.com/policies/terms-of-use
> Ownership of Content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.
I agree. You can even listen to the NYT Hard Fork podcast (that I recommend btw https://www.nytimes.com/2023/11/03/podcasts/hard-fork-execut...) where they recently had Harvard copyright law professor Rebecca Tushnet on as a guest.
They asked her about the issue of copyrighted training data. Her response was:
""" Google, for example, with the book project, doesn’t give you the full text and is very careful about not giving you the full text. And the court said that the snippet production, which helps people figure out what the book is about but doesn’t substitute for the book, is a fair use.
So the idea of ingesting large amounts of existing works, and then doing something new with them, I think, is reasonably well established. The question is, of course, whether we think that there’s something uniquely different about LLMs that justifies treating them differently. """
Now for my take: Proving that OpenAI trained on NYT articles is not sufficient IMO. They would need to prove that OpenAI is providing a substitutable good via verbatim copying, which I don't think you can easily prove. It takes a lot of prompt engineering and luck to pull out any verbatim articles. It's well-established that LLMs screw up even well-known facts. It's quite hard to accurately pull out the training data verbatim.
2. The non-profit OpenAI, Inc. company is not to be confused with the for-profit OpenAI GP, LLC [0] that it controls. OpenAI was solely a non-profit from 2015-2019, and, in 2019, the for-profit arm was created, prior to the launch of ChatGPT. Microsoft has a significant investment in the for-profit company, which is why they're included in this lawsuit.
https://dspace.mit.edu/handle/1721.1/153216
As it should be.
[1] http://web.archive.org/web/20120608192927/http://www.google....
[2] https://steemit.com/online/@jaroli/how-google-search-result-...
[3] https://www.smashingmagazine.com/2009/09/search-results-desi...
[4] Next page
:)
Search for "four factors of fair use", e.g. https://fairuse.stanford.edu/overview/fair-use/four-factors/, which courts use to decide if a derived work is fair use. I think OpenAI will get killed in that fourth factor, "the effect of the use upon the potential market", which is what this case is really about. If the use substantially negatively affects the market for the original work, which I think it's easy to argue that it does, that is a huge factor against awarding a fair use exemption to OpenAI.
Also, plagiarism has nothing to do with copyright. It has to do with attribution. This is easily proven: you can plagiarise Beethoven's music even though it's public domain.
"We also collect the content you create, upload, or receive from others when using our services. This includes things like email you write and receive, photos and videos you save, docs and spreadsheets you create, and comments you make on YouTube videos."
Furthermore, if we manage to "untrain" AI on certain pieces of content, then copyright would really become "brain" damage too. Like, the perceptrons and stuff.
[1] https://www.youtube.com/watch?v=XO9FKQAxWZc
[2] No, I'm not an AI, just autistic.
More critically, while fair use decisions are famously a judgement call, I think OpenAI will lose this based on the "effect of the fair use on the potential market" of the original content test. From https://fairuse.stanford.edu/overview/fair-use/four-factors/ :
> Another important fair use factor is whether your use deprives the copyright owner of income or undermines a new or potential market for the copyrighted work. Depriving a copyright owner of income is very likely to trigger a lawsuit. This is true even if you are not competing directly with the original work.
> For example, in one case an artist used a copyrighted photograph without permission as the basis for wood sculptures, copying all elements of the photo. The artist earned several hundred thousand dollars selling the sculptures. When the photographer sued, the artist claimed his sculptures were a fair use because the photographer would never have considered making sculptures. The court disagreed, stating that it did not matter whether the photographer had considered making sculptures; what mattered was that a potential market for sculptures of the photograph existed. (Rogers v. Koons, 960 F.2d 301 (2d Cir. 1992).)
and especially
> “The economic effect of a parody with which we are concerned is not its potential to destroy or diminish the market for the original—any bad review can have that effect—but whether it fulfills the demand for the original.” (Fisher v. Dees, 794 F.2d 432 (9th Cir. 1986).)
The "whether it fulfills the demand of the original" is clearly where NYTimes has the best argument.
I definitely agree with that (at least the "far in uncharted territory bit", but as far as "speculation being useless", we're all pretty much just analyzing/guessing/shooting the shit here, so I'm not sure "usefulness" is the right barometer), which is why I'm looking forward to this case, and I also totally agree the assessment is flexible.
But I don't think your argument that it doesn't negatively affect the market holds water. Courts have held in the past that the market for impact is pretty broadly defined, e.g.
> For example, in one case an artist used a copyrighted photograph without permission as the basis for wood sculptures, copying all elements of the photo. The artist earned several hundred thousand dollars selling the sculptures. When the photographer sued, the artist claimed his sculptures were a fair use because the photographer would never have considered making sculptures. The court disagreed, stating that it did not matter whether the photographer had considered making sculptures; what mattered was that a potential market for sculptures of the photograph existed. (Rogers v. Koons, 960 F.2d 301 (2d Cir. 1992).)
From https://fairuse.stanford.edu/overview/fair-use/four-factors/
Unclear what that corpora might be, or if its the same books2 you are referring to.
Why do you say that? Search engines would at least direct the viewer to the source. NYT gets 35%+ of its traffic from Google: https://www.similarweb.com/website/nytimes.com/#traffic-sour...
https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...
>> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."
How are they giving you the rights to the work if they don't own it? They are literally asserting that they are in a position to assign the rights (to the output) to the user - that is a literal claim of ownership.
IOW, if someone says "Take this from me, I assure you it is legal to do so", they are asserting ownership of that thing.
"Google Agrees to Pay Canadian Media for Using Their Content" - https://www.nytimes.com/2023/11/29/world/americas/google-can...
https://docs.github.com/en/copilot/configuring-github-copilo...
Given how cheap text search is compared with LLM inference, and that GitHub reuses the same infrastructure for its code search, I doubt it adds more than 1% to the total cost.
I have no idea what on earth you are talking about. People and corporations are sued for copyright infringement all the time.
https://copyrightalliance.org/copyright-cases-2022/
Reading and consuming other people content isn't illegal, but it also wouldn't be for a computer.
Reading and consuming content with the sole purpose of reproducing it verbatim is frowned upon, and can be sued, whether it's an LLM or a sweatshop in India.
Wikipedia has some words on how summaries related to copyright law: https://en.wikipedia.org/wiki/Wikipedia:Plot-only_descriptio...
In this case it's the NYT vs OpenAI, last decade it was the RIAA vs Napster.
I'm not much of a libertarian (in fact, I'd prefer a better central government), but I also don't believe IP should have as much protection as it does. I think copyright law is in need of a complete rewrite, and yes, utilitarianism and public use would be part of the consideration. If it were up to me I'd scrap the idea of private intellectual property altogether and publicly fund creative works and release them into the public domain, similar to how we treat creative works of the federal government: https://en.wikipedia.org/wiki/Copyright_status_of_works_by_t...
Rather than capitalists competing to own ideas, grant-seekers would seek funding to pursue and further develop their ideas. No one would get rich off such a system, which is a side benefit in my eyes.
> "[...] the fair use of a copyrighted work [...] for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include—
(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
(2) the nature of the copyrighted work;
(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
(4) the effect of the use upon the potential market for or value of the copyrighted work."
----
So here we have OpenAI, ostensibly a nonprofit, using portions of a copyrighted work for commenting on and educating (the prompting user), in a way that doesn't directly compete with NYT (nobody goes "Hey ChatGPT, what's today's news?"), not intentionally copying and publishing their materials (they have to specifically probe it to get it to spit out the copyrighted content). There's not a commercial intent to compete with the NYT's market. There is a subscription fee, but there is also tuition in private classrooms and that doesn't automatically make it a copyright violation. And citing the source or not doesn't really factor into copyright, that's just a politeness thing.
I'm not a lawyer. It's just not that straightforward. But of course the court will decide, not us randos on the internet...
I was a journalism student in college, long before ML became a threat, and even then it was a dying industry. I chose not to enter it because the prospects were so bleak. Then a few months ago I actually tried to get a journalism job locally, but never heard back. The former reporter there also left because the pay wasn't enough for the costs of living in this area, but that had nothing to do with OpenAI. It's just a really tough industry.
And even as a web dev, I knew it was only a matter of time before I became unnecessary. Whether it was Wordpress or SquareSpace or Skynet, it was bound to happen at some point. I'm going back to school now to try to enter another field altogether, in part because the writing is on the ~~wall~~ chatbox for us.
I don't think we as a society owe it to any profession to artificially keep it alive as it's historically been. We do it owe it to INDIVIDUALS -- fellow citizens/residents -- to provide them with some way forward, but I'd prefer that be reskilling and social support programs, welfare if nothing else, rather than using ancient copyright law to favor old dying industries over new ones that can actually have a much bigger impact.
In my eyes, the NYT is just another news outlet. A decent one, sure, but not anything substantially different than WaPo or the LA Times or whatever. How many Pulitzer winners have come and gone? https://en.wikipedia.org/wiki/Pulitzer_Prize_for_Breaking_Ne...
If we lost the NYT, it'd be a bit of nostalgia, but next week life would go on as usual. They're not even as specialized as, say, National Geographic or PopSci or The Information or 404 Media or The Center for Investigative Reporting, any of which would be harder to replace than another generic big news outlet.
AI, meanwhile, has the potential to be way bigger than even the Internet, IMO, and we should be devoting Manhattan Project-like resources to it.
[1]https://asia.nikkei.com/Business/Technology/Japan-panel-push...
https://libraries.emory.edu/research/copyright/copyright-dat...
But even if it did an exact match search is not enough here. What if you take the source code and rename all variables and functions? The filter wouldn't trigger, but it'd still be copyright infringement (whether a human or a machine does that).
For such a filter to be effective it'd at least have to build a canonical representation of the program's AST and then check for similarities with existing programs. Doing that at scale would be challenging.
Wouldn't it be better to: * Either not include copyrighted content in the training material in the first place * Explicitly tag the training material with license and origin infornation, such that the final output can produce a proof of what training material was relevant for producing that output and don't mix differently licensed content.
If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.
But I still haven't seen a real example of it spitting out a book verbatim. You know where I think it got chunks of "copyright" text from GRRM's books?
Wikipedia. And https://gameofthrones.fandom.com/wiki/Wiki_of_Westeros, https://awoiaf.westeros.org/index.php/Main_Page, https://data.world/datasets/game-of-thrones all the god dammed wikis, databases etc based on his work, of which there are many, and of which most quote sections or whole passages of the books.
Someone prove to me that GPT can reproduce enough text verbatim that it makes it clear that it was trained on the original text first hand basis, rather than second hand from other sources.
> OpenAI had no role in the creation of this content, yet with minimal prompting, will recite large portions of it verbatim.
This is the smoking gun. GPT-4 is a large model and hence highly likely to reproduce content. They have many such examples in the court filing https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...
IANAL but that's a slam dunk of copyright violation.
NYT will likely win.
Also why OpenAI should not go YOLO scaling up to GPT-5 which will likely recite more copyrighted content. More parameters, more memorization.