zlacker

The New York Times is suing OpenAI and Microsoft for copyright infringement

submitted by ssgodd+(OP) on 2023-12-27 13:58:21 | 593 points 854 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
1. pm90+B2[view] [source] 2023-12-27 14:18:28
>>ssgodd+(OP)
NYT article with a lot more context https://www.nytimes.com/2023/12/27/business/media/new-york-t...
2. batch1+D2[view] [source] 2023-12-27 14:18:34
>>ssgodd+(OP)
> The New York Times is suing OpenAI and Microsoft over claims the companies built their AI models by “copying and using millions” of the publication’s articles and now “directly compete” with the outlet’s content.

Millions? Damn, they can churn out some content. 13 million[0]!.

[0] https://archive.nytimes.com/www.nytimes.com/ref/membercenter....

◧◩◪
54. solard+E6[view] [source] [discussion] 2023-12-27 14:41:28
>>Abraha+p5
The NYT also hallucinates from time to time: https://www.nytimes.com/2003/05/11/us/correcting-the-record-...

(That's a story about Jayson Blair, one of their reporters who just plain made up stories and sources for months before getting caught)

Edit: Sheesh, even their apology is paywalled. Wiki background: https://en.wikipedia.org/wiki/Jayson_Blair?wprov=sfla1

◧◩◪◨⬒
123. muglug+h9[view] [source] [discussion] 2023-12-27 14:55:25
>>Abraha+d7
The NY Times company has an analogue to the Apple media relations page: https://investors.nytco.com/news-and-events/press-releases/

In most respected media companies there is a really-important-to-journalists-who-work-there firewall between these sorts of corporate battles and the reporting on them.

◧◩
136. achron+A9[view] [source] [discussion] 2023-12-27 14:57:13
>>blagie+z5
>the ship has sailed

Certainly, but debating the spirit behind copyright or even "how to regulate AI" (a vast topic, to put it mildly) is only one possible route these lawsuits could take.

I suspect that ultimately the winner is going to be business first (of course in the name of innovation), and the law second, and ethics coming last -- if Google can scan 129 million books [1] and store them without even a slap on the wrist [2], OpenAI and anyone of that size can most surely continue to do what they're doing. This lawsuit and others like it are just the drama of 'due process'.

[1] https://booksearch.blogspot.com/2010/08/books-of-world-stand... [2] https://www.reuters.com/article/idUSBRE9AD0TT/

137. breadw+E9[view] [source] 2023-12-27 14:57:24
>>ssgodd+(OP)
Here's the most important part (from NYT story on the lawsuit [1]):

In one example of how A.I. systems use The Times’s material, the suit showed that Browse With Bing, a Microsoft search feature powered by ChatGPT, reproduced almost verbatim results from Wirecutter, The Times’s product review site. The text results from Bing, however, did not link to the Wirecutter article, and they stripped away the referral links in the text that Wirecutter uses to generate commissions from sales based on its recommendations.

[1] https://www.nytimes.com/2023/12/27/business/media/new-york-t...

141. sschue+U9[view] [source] 2023-12-27 14:58:38
>>ssgodd+(OP)
If I were the CIA/US gov officials I would somehow want the NY Times to drop this case as one would not want AIs that don't have the talking points and propaganda pushed via papers not be part of the record.

I am not saying that the NY Times is a CIA asset but from the crap they have printed in the past like the whole WMDs in Iraq saga and the puff piece of Elizabeth Holmes they are far from a completely independent and propaganda free paper. Henry Kissinger would call the paper and have his talking point printed the next day regarding Vietnam. [1]

There is a huge conflict of access to government officials and independents of papers.

[1] https://youtu.be/kn8Ocz24V-0?si=kWyWXztWGjS_AJVl

142. jmyeet+X9[view] [source] 2023-12-27 14:59:11
>>ssgodd+(OP)
When the US entered WWI, they couldn't build a plane despite inventing them. They had to buy planes from the French. Why? The Wright Brothers patent war [1]. This led to Congress creating a patent pool for avionics that exists to this day.

Honestly, I get this feeling about these lawsuits about using content to train LLMs.

Think of it this way: in growing up and learning to read and getting an education you read any number of books, articles, Web pages, magazines, etc. You viewed any number of artworks, buildings, cars, vehicles, furniture, etc, many of which might have design patents. We have such silliness as it being illegal to distribute photos commercially of the Eiffel Tower at night [2].

What's the differnce between training a model on text and images and educating a person with text and images, really? If I read too many NYT articles, am I going to get sued for using too much "training data"?

Currently we need copious quantities of training data for LLMs. I believe this is because we're in the early days of this tech. I mean no person has read millions of articles or books. At some point models will get better with substantially smaller training sets. And then, how many articles is too many as far as these suits go?

[1]: https://en.wikipedia.org/wiki/Wright_brothers_patent_war

[2]: https://www.travelandleisure.com/photography/illegal-to-take...

◧◩
152. iandan+pa[view] [source] [discussion] 2023-12-27 15:01:50
>>wg0+79
You may be interested in https://unlearning-challenge.github.io/
◧◩
158. delta_+Da[view] [source] [discussion] 2023-12-27 15:03:13
>>wg0+79
> almost no workload (other than CAD, Graphics) runs on Windows or Unix including this very forum

About a fifth to a quarter of public-facing Web servers are Windows Server. Most famously, Stack Overflow[1].

[1]: https://meta.stackexchange.com/a/10370/1424704

◧◩◪
159. pastor+Ea[view] [source] [discussion] 2023-12-27 15:03:16
>>theonl+J6
https://www.tesla.com/blog/all-our-patent-are-belong-you
◧◩◪
196. candid+Ac[view] [source] [discussion] 2023-12-27 15:12:25
>>delta_+Da
> About a fifth to a quarter of public-facing Web servers are Windows Server

Got a link for that? Best I can find is 5% of all websites: https://www.netcraft.com/blog/may-2023-web-server-survey/

◧◩◪
201. iudqno+Tc[view] [source] [discussion] 2023-12-27 15:13:47
>>achron+A9
The court decided to focus on the tiny snippets Google displayed rather than the full text on their servers backing the search functionality. The court found significant that Google deliberately limited the snippet view so it couldn't be used as a replacement for purchasing the original book. The opinion is a relatively easy read, I highly recommend it if you're interested in the issue. It's also notable the court commented that the Google case was right on the edge of fair use.

https://law.justia.com/cases/federal/appellate-courts/ca2/13...

◧◩◪◨
214. solarp+td[view] [source] [discussion] 2023-12-27 15:17:19
>>mbruml+v8
Warner Brothers sued and won against Asylum for this very thing lmao. https://en.m.wikipedia.org/wiki/Mockbuster
◧◩◪
216. LordKe+wd[view] [source] [discussion] 2023-12-27 15:17:27
>>ubutle+ja
The Sackler family owned Purdue Pharma, which created OxyContin and heavily marketed the drug. Many Americans see the family as partially responsible for kickstarting the opioid epidemic.

https://en.wikipedia.org/wiki/Sackler_family

The family has been largely successful at avoiding any personal liability in Purdue’s litigations. Many people feel the settlements of the Purdue lawsuits were too lenient. One of the key perceived aspects of the final settlements was that there was too many victims of the opioid epidemic for the courts to handle and attempt to make whole.

243. andy99+ff[view] [source] 2023-12-27 15:28:01
>>ssgodd+(OP)
When it comes to foundation models I think there needs to be a distinction between potential and actual infringement. You can use a broadly trained foundation model to generate copyright infringing content, just like you can use your brain to do so. But the fact that a model can generate such content doesn't mean it infringes by its mere existence. https://www.marble.onl/posts/general_technology_doesnt_viola...
245. TekMol+Df[view] [source] 2023-12-27 15:30:44
>>ssgodd+(OP)
The actual complaint is here:

https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

They state that "with minimal prompting", ChatGPT will "recite large portions" of some of their articles with only small changes.

I wonder why they don't sue the wayback machine first. You can get the whole article on the wayback machine. Not just portions. And not with small changes but verbatim. And you don't need any special prompting. As soon as you are confronted with a paywall window on the times websites, all you need to do is to go to the wayback machine, paste the url and you can read it.

◧◩
263. jimmyd+Hg[view] [source] [discussion] 2023-12-27 15:35:41
>>dissid+B6
Another way to look at it is to consider being stolen part of business model.

There are massive number of piracy content in China, but Hollywood are also making billions in the same time, and in fact China already surpassed NA as #1 market for Hollywood years ago [1].

NYT is obvious different than Disney, and may not be able to bend their knees far enough, but maybe there can be similar ways out of this.

[1] https://www.theatlantic.com/culture/archive/2021/09/how-holl...

◧◩◪
264. meowfa+Lg[view] [source] [discussion] 2023-12-27 15:35:47
>>phatfi+j8
It doesn't refuse. See this comment containing examples from the complaint: >>38782668
◧◩◪◨⬒⬓
271. frakt0+vh[view] [source] [discussion] 2023-12-27 15:39:16
>>bnralt+sg
Except that every stackoverflow post is explicitly creative commons: https://stackoverflow.com/help/licensing
◧◩◪
279. smolde+fi[view] [source] [discussion] 2023-12-27 15:43:41
>>ametra+Yc
George R. R. Martin authored A Game of Thrones, but lost in-court against Google when Google Books reproduced parts of his text verbatim: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

No piracy or even AI was required, here. Google's defense was that their product couldn't reproduce the book in it's entirety, which was proven and made the prosecution about Fair Use instead. Given that it was much harder to prosecute on those grounds, Google tried coercing the authors into a settlement before eventually the District Court dropped the case in Google's favor altogether.

OpenAI's lawyers are aware of the precedent on copyright law. They're going to argue their application is Fair Use, and they might get away with it.

309. ssully+6l[view] [source] 2023-12-27 16:01:41
>>ssgodd+(OP)
Haven’t seen anyone mention how Apple is exploring deals with news publishers, like the NYTimes, to train its LLMs[1].

[1]: https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...

311. Satees+bl[view] [source] 2023-12-27 16:02:28
>>ssgodd+(OP)
I posted this a few months ago - >>34381399

Piracy at scale

◧◩
323. bigbil+jm[view] [source] [discussion] 2023-12-27 16:09:14
>>wg0+79
> But how that can be possible for an LLM?

They should have thought of that before they went ahead and trained on whatever they could get.

Image models are going to have similar problems, even if they win on copyright there's still CSAM in there: https://www.theregister.com/2023/12/20/csam_laion_dataset/

◧◩◪◨
334. aantix+6n[view] [source] [discussion] 2023-12-27 16:14:31
>>apante+6m
It's possible. Perplexity.ai is trying to solve this problem.

E.g. "Japan's App Store antitrust case"

https://www.perplexity.ai/search/Japans-App-Store-GJNTsIOVSy...

◧◩◪◨⬒
343. belter+Nn[view] [source] [discussion] 2023-12-27 16:18:15
>>z7+Rm
Still is in many countries with excellent diplomatic relations with the Western World:

https://www.cfr.org/backgrounder/what-kafala-system

◧◩◪
348. solard+ro[view] [source] [discussion] 2023-12-27 16:22:08
>>aantix+1l
There's a few levels to this...

Would it be more rigorous for AI to cite its sources? Sure, but the same could be said for humans too. Wikipedia editors, scholars, and scientists all still struggle with proper citations. NYT itself has been caught plagiarizing[1].

But that doesn't really solve the underlying issue here: That our copyright laws and monetization models predate the Internet and the ease of sharing/paywall bypass/piracy. The models that made sense when publishing was difficult and required capital-intensive presses don't necessarily make sense in the copy and paste world of today. Whether it's journalists or academics fighting over scraps just for first authorship (while some random web dev makes 3x more money on ad tracking), it's just not a long-term sustainable way to run an information economy.

I'd also argue that attribution isn't really that important to most people to begin with. Stuff, real and fake, gets shared on social media all the time with limited fact-checking (for better or worse). In general, people don't speak in a rigorous scholarly way. And people are often wrong, with faulty memories, or even incentivized falsehoods. Our primate brains aren't constantly in fact-checking mode and we respond better to emotional, plot-driven narratives than cold statistics. There are some intellectuals who really care deeply about attributions, but most humans won't.

Taken the above into consideration:

1) Useful AI does not necessarily require attribution

2) AI piracy is just a continuation of decades of digital piracy, and the solutions that didn't work in the 1990s and 2000s still won't work against AI

3) We need some better way to fund human creativity, especially as it gets more and more commoditized

4) This is going to happen with or without us. Cat's outta the bag.

I don't think using old IP law to hold us back is really going to solve anything in the long term. Yes, it'd be classy of OpenAI to pay everyone it sourced from, but long term that doesn't matter. Creativity has always been shared and copied and imitated and stolen, the only question is whether the creators get compensated (or even enriched) in the meantime. Sometimes yes, sometimes no, but it happens regardless. There'll always be noncommercial posts by the billions of people who don't care if AI, or a search engine, or Twitter, or whoever, profits off them.

If we get anywhere remotely close to AGI, a lot of this won't matter. Our entire economic and legal systems will have to be redone. Maybe we can finally get rid of the capitalist and lawyer classes. Or they'll probably just further enslave the rest of us with the help of their robo-bros, giving AI more rights than poor people.

But either way, this is way bigger than the economics of 19th-century newspapers...

[1] https://en.wikipedia.org/wiki/Jayson_Blair#Plagiarism_and_fa...

◧◩◪
360. JW_000+Hp[view] [source] [discussion] 2023-12-27 16:27:49
>>fallin+07
In the EU, countries can (and do) impose levies on printers and scanners because they may be used to copy copyrighted material (https://www.insideglobaltech.com/2013/07/12/eu-member-states...). Similar levies exist for blank CDs, USB sticks, MP3 players etc. In the US, this applies to "blank CDs and personal audio devices, media centers, satellite radio devices, and car audio systems that have recording capabilities." (See https://en.wikipedia.org/wiki/Private_copying_levy)
◧◩
376. 015a+tq[view] [source] [discussion] 2023-12-27 16:32:09
>>kbos87+Na
> a more established competitor

Apple is already doing this: https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...

Apple caught a lot of shit over the past 18 months for their lack of AI strategy; but I think two years from now they're going to look like geniuses.

◧◩◪◨
382. aragon+or[view] [source] [discussion] 2023-12-27 16:36:51
>>JCM9+Bj
> They’re being sued for passing off substantial bits of NYTimes content as their own IP and then charging for it saying it’s their own IP.

In what sense are they claiming their generated contents as their own IP?

https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...

> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."

https://openai.com/policies/terms-of-use

> Ownership of Content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.

◧◩◪
426. spunke+sv[view] [source] [discussion] 2023-12-27 16:57:52
>>theGnu+Ri
> It's likely fair use.

I agree. You can even listen to the NYT Hard Fork podcast (that I recommend btw https://www.nytimes.com/2023/11/03/podcasts/hard-fork-execut...) where they recently had Harvard copyright law professor Rebecca Tushnet on as a guest.

They asked her about the issue of copyrighted training data. Her response was:

""" Google, for example, with the book project, doesn’t give you the full text and is very careful about not giving you the full text. And the court said that the snippet production, which helps people figure out what the book is about but doesn’t substitute for the book, is a fair use.

So the idea of ingesting large amounts of existing works, and then doing something new with them, I think, is reasonably well established. The question is, of course, whether we think that there’s something uniquely different about LLMs that justifies treating them differently. """

Now for my take: Proving that OpenAI trained on NYT articles is not sufficient IMO. They would need to prove that OpenAI is providing a substitutable good via verbatim copying, which I don't think you can easily prove. It takes a lot of prompt engineering and luck to pull out any verbatim articles. It's well-established that LLMs screw up even well-known facts. It's quite hard to accurately pull out the training data verbatim.

◧◩◪◨
435. cowsup+Aw[view] [source] [discussion] 2023-12-27 17:04:39
>>Baldbv+V5
1. Non-profit != "not making a profit." A non-profit can still earn monetary profit, and many do.

2. The non-profit OpenAI, Inc. company is not to be confused with the for-profit OpenAI GP, LLC [0] that it controls. OpenAI was solely a non-profit from 2015-2019, and, in 2019, the for-profit arm was created, prior to the launch of ChatGPT. Microsoft has a significant investment in the for-profit company, which is why they're included in this lawsuit.

[0] https://openai.com/our-structure

◧◩◪◨⬒
453. tansey+Hx[view] [source] [discussion] 2023-12-27 17:11:16
>>aantix+bp
Can you imagine spending decades of your life studying antibiotics, only to have an AI graph neural network beat you to the punch by conceiving an entire new class of antibiotics (first in 60 years) and then getting published in Nature.

https://www.nature.com/articles/d41586-023-03668-1

◧◩◪◨⬒⬓
466. aantix+Ly[view] [source] [discussion] 2023-12-27 17:18:05
>>tansey+Hx
It looks like the published paper managed to include plenty of citations.

https://dspace.mit.edu/handle/1721.1/153216

As it should be.

◧◩◪◨⬒⬓
489. aantix+pB[view] [source] [discussion] 2023-12-27 17:31:42
>>jquery+dy
"Now displaying 3 citations out of ~150,000,000.."

[1] http://web.archive.org/web/20120608192927/http://www.google....

[2] https://steemit.com/online/@jaroli/how-google-search-result-...

[3] https://www.smashingmagazine.com/2009/09/search-results-desi...

[4] Next page

:)

◧◩◪◨⬒
495. Captai+8C[view] [source] [discussion] 2023-12-27 17:35:45
>>shkkmo+co
It does exist, and you'd be glad to know that it's going in the pro-AI/training direction: https://www.reedsmith.com/en/perspectives/ai-in-entertainmen...
◧◩
504. hn_thr+mD[view] [source] [discussion] 2023-12-27 17:43:33
>>Aurorn+84
While I think the verbatim text strengthens NYTimes argument, I think people are focusing on that too strongly, the idea being that if OpenAI could just "fix" that, then they'd be in the clear.

Search for "four factors of fair use", e.g. https://fairuse.stanford.edu/overview/fair-use/four-factors/, which courts use to decide if a derived work is fair use. I think OpenAI will get killed in that fourth factor, "the effect of the use upon the potential market", which is what this case is really about. If the use substantially negatively affects the market for the original work, which I think it's easy to argue that it does, that is a huge factor against awarding a fair use exemption to OpenAI.

◧◩◪◨
505. Captai+LD[view] [source] [discussion] 2023-12-27 17:46:03
>>belter+Vi
The expectation to make money from artificially restricting an abundant resource. While copyright is a way to create funding, it also massively harms society by restricting future creators from being able to freely reuse previous works. Modern ways to deal with this are patronage, government funding, foundations (e.g. NLNet) and crowdfunding.

Also, plagiarism has nothing to do with copyright. It has to do with attribution. This is easily proven: you can plagiarise Beethoven's music even though it's public domain.

https://questioncopyright.org/minute-memes-credit-is-due

◧◩◪◨
522. jjtheb+NF[view] [source] [discussion] 2023-12-27 17:56:10
>>mbruml+s7
from the Google Terms of Service ( https://policies.google.com/privacy?hl=en-US ), makes me wonder who owns what, since users of Gmail agree to it.

"We also collect the content you create, upload, or receive from others when using our services. This includes things like email you write and receive, photos and videos you save, docs and spreadsheets you create, and comments you make on YouTube videos."

◧◩◪
529. Captai+iG[view] [source] [discussion] 2023-12-27 17:58:48
>>hacker+7p
Copyright Is Brain Damage by Nina Paley [1] claimed that culture is like a bunch of neurons passing and evolving data to each other, and copyright is like severing the ties between the neurons, like brain damage. It also presented [2] an alternative way of viewing art and science, as products of the common culture, not a product purely from the creator, to be privatised. This sounds really relevant to your comment.

Furthermore, if we manage to "untrain" AI on certain pieces of content, then copyright would really become "brain" damage too. Like, the perceptrons and stuff.

[1] https://www.youtube.com/watch?v=XO9FKQAxWZc

[2] No, I'm not an AI, just autistic.

◧◩◪◨
541. hn_thr+JH[view] [source] [discussion] 2023-12-27 18:06:20
>>whichf+jF
I'm not saying copyright is without problems (e.g. there is no reason I think its protection should be as long as it is), but I think the opposite, where the incentive to create new content (especially in the case of news reporting) is completely killed because someone else gets to vacuum up all the profits, is worse. I mean, existing copyright does protect tons of independent writers, artists, etc. and prevents all of the profits from their output from being "sucked up" by a few entities.

More critically, while fair use decisions are famously a judgement call, I think OpenAI will lose this based on the "effect of the fair use on the potential market" of the original content test. From https://fairuse.stanford.edu/overview/fair-use/four-factors/ :

> Another important fair use factor is whether your use deprives the copyright owner of income or undermines a new or potential market for the copyrighted work. Depriving a copyright owner of income is very likely to trigger a lawsuit. This is true even if you are not competing directly with the original work.

> For example, in one case an artist used a copyrighted photograph without permission as the basis for wood sculptures, copying all elements of the photo. The artist earned several hundred thousand dollars selling the sculptures. When the photographer sued, the artist claimed his sculptures were a fair use because the photographer would never have considered making sculptures. The court disagreed, stating that it did not matter whether the photographer had considered making sculptures; what mattered was that a potential market for sculptures of the photograph existed. (Rogers v. Koons, 960 F.2d 301 (2d Cir. 1992).)

and especially

> “The economic effect of a parody with which we are concerned is not its potential to destroy or diminish the market for the original—any bad review can have that effect—but whether it fulfills the demand for the original.” (Fisher v. Dees, 794 F.2d 432 (9th Cir. 1986).)

The "whether it fulfills the demand of the original" is clearly where NYTimes has the best argument.

◧◩◪◨⬒
549. hn_thr+SI[view] [source] [discussion] 2023-12-27 18:12:45
>>throwu+eF
> we're so far in uncharted territory any speculation is useless

I definitely agree with that (at least the "far in uncharted territory bit", but as far as "speculation being useless", we're all pretty much just analyzing/guessing/shooting the shit here, so I'm not sure "usefulness" is the right barometer), which is why I'm looking forward to this case, and I also totally agree the assessment is flexible.

But I don't think your argument that it doesn't negatively affect the market holds water. Courts have held in the past that the market for impact is pretty broadly defined, e.g.

> For example, in one case an artist used a copyrighted photograph without permission as the basis for wood sculptures, copying all elements of the photo. The artist earned several hundred thousand dollars selling the sculptures. When the photographer sued, the artist claimed his sculptures were a fair use because the photographer would never have considered making sculptures. The court disagreed, stating that it did not matter whether the photographer had considered making sculptures; what mattered was that a potential market for sculptures of the photograph existed. (Rogers v. Koons, 960 F.2d 301 (2d Cir. 1992).)

From https://fairuse.stanford.edu/overview/fair-use/four-factors/

◧◩◪◨
569. belter+HK[view] [source] [discussion] 2023-12-27 18:22:03
>>devind+0C
Really? Because the GPT-3 paper talks about "...two internet-based books corpora (Books1 and Books2)..." (see pages 8 and 9) - https://arxiv.org/pdf/2005.14165.pdf

Unclear what that corpora might be, or if its the same books2 you are referring to.

◧◩
574. tracyh+zL[view] [source] [discussion] 2023-12-27 18:26:31
>>kbos87+Na
> the first being at the birth of modern search engines.

Why do you say that? Search engines would at least direct the viewer to the source. NYT gets 35%+ of its traffic from Google: https://www.similarweb.com/website/nytimes.com/#traffic-sour...

◧◩
581. benrow+KM[view] [source] [discussion] 2023-12-27 18:33:34
>>wg0+79
Google have their "Machine Unlearning" challenge to address this specific issue - removing the influence of given training data without retraining from scratch. Seems like a hard problem. https://blog.research.google/2023/06/announcing-first-machin...
◧◩◪◨⬒
587. lelant+JN[view] [source] [discussion] 2023-12-27 18:39:29
>>aragon+or
> In what sense are they claiming their generated contents as their own IP?

https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...

>> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."

How are they giving you the rights to the work if they don't own it? They are literally asserting that they are in a position to assign the rights (to the output) to the user - that is a literal claim of ownership.

IOW, if someone says "Take this from me, I assure you it is legal to do so", they are asserting ownership of that thing.

◧◩◪
606. belter+VQ[view] [source] [discussion] 2023-12-27 18:56:06
>>tracyh+zL
Just because they asked for forgiveness instead of asking first for permission, it's original sins will not be erased :-)

"Google Agrees to Pay Canadian Media for Using Their Content" - https://www.nytimes.com/2023/11/29/world/americas/google-can...

◧◩◪◨⬒⬓⬔
608. TheCor+7R[view] [source] [discussion] 2023-12-27 18:57:04
>>tsimio+tM
GitHub Copilot supports that:

https://docs.github.com/en/copilot/configuring-github-copilo...

Given how cheap text search is compared with LLM inference, and that GitHub reuses the same infrastructure for its code search, I doubt it adds more than 1% to the total cost.

◧◩◪
615. happym+FR[view] [source] [discussion] 2023-12-27 18:59:42
>>logicc+DP
> while humans face no such restriction.

I have no idea what on earth you are talking about. People and corporations are sued for copyright infringement all the time.

https://copyrightalliance.org/copyright-cases-2022/

Reading and consuming other people content isn't illegal, but it also wouldn't be for a computer.

Reading and consuming content with the sole purpose of reproducing it verbatim is frowned upon, and can be sued, whether it's an LLM or a sweatshop in India.

◧◩
713. pauldd+Vm1[view] [source] [discussion] 2023-12-27 21:50:43
>>mvcald+K3
There is no difference between an LLM summarizing a copyrighted work and a Wikipedia contributor summarizing a copyrighted work.

Wikipedia has some words on how summaries related to copyright law: https://en.wikipedia.org/wiki/Wikipedia:Plot-only_descriptio...

◧◩◪
722. solard+iq1[view] [source] [discussion] 2023-12-27 22:09:28
>>ahepp+vI
I don't know that "absolute utilitarianism", if such a thing could even exist, would make a sound moral framework; that sounds too much like a "tyranny of the majority" situation. Tech companies shouldn't make the rules. And they shouldn't be allowed to just do whatever they want. However, this isn't that. This is just a debate over intellectual property and copyright law.

In this case it's the NYT vs OpenAI, last decade it was the RIAA vs Napster.

I'm not much of a libertarian (in fact, I'd prefer a better central government), but I also don't believe IP should have as much protection as it does. I think copyright law is in need of a complete rewrite, and yes, utilitarianism and public use would be part of the consideration. If it were up to me I'd scrap the idea of private intellectual property altogether and publicly fund creative works and release them into the public domain, similar to how we treat creative works of the federal government: https://en.wikipedia.org/wiki/Copyright_status_of_works_by_t...

Rather than capitalists competing to own ideas, grant-seekers would seek funding to pursue and further develop their ideas. No one would get rich off such a system, which is a side benefit in my eyes.

◧◩◪◨⬒
723. solard+rr1[view] [source] [discussion] 2023-12-27 22:18:02
>>spopej+ng1
I don't think fair use is quite that black-and-white. There are many factors: https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors (from 17 USC 107: https://www.govinfo.gov/content/pkg/USCODE-2010-title17/html...)

> "[...] the fair use of a copyrighted work [...] for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include—

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

(2) the nature of the copyrighted work;

(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

(4) the effect of the use upon the potential market for or value of the copyrighted work."

----

So here we have OpenAI, ostensibly a nonprofit, using portions of a copyrighted work for commenting on and educating (the prompting user), in a way that doesn't directly compete with NYT (nobody goes "Hey ChatGPT, what's today's news?"), not intentionally copying and publishing their materials (they have to specifically probe it to get it to spit out the copyrighted content). There's not a commercial intent to compete with the NYT's market. There is a subscription fee, but there is also tuition in private classrooms and that doesn't automatically make it a copyright violation. And citing the source or not doesn't really factor into copyright, that's just a politeness thing.

I'm not a lawyer. It's just not that straightforward. But of course the court will decide, not us randos on the internet...

◧◩◪
731. solard+Du1[view] [source] [discussion] 2023-12-27 22:36:08
>>DeIlli+0O
Is it? My job as a frontend dev is similarly threatened by OpenAI, maybe even more so than journalists'. The very company I usually like to pay to help with my work (Vercel) is in the process of using that same money to replace me with AI as we speak, lol (https://vercel.com/blog/announcing-v0-generative-ui). I'm not complaining. I think it's great progress, even if it'll make me obsolete soon.

I was a journalism student in college, long before ML became a threat, and even then it was a dying industry. I chose not to enter it because the prospects were so bleak. Then a few months ago I actually tried to get a journalism job locally, but never heard back. The former reporter there also left because the pay wasn't enough for the costs of living in this area, but that had nothing to do with OpenAI. It's just a really tough industry.

And even as a web dev, I knew it was only a matter of time before I became unnecessary. Whether it was Wordpress or SquareSpace or Skynet, it was bound to happen at some point. I'm going back to school now to try to enter another field altogether, in part because the writing is on the ~~wall~~ chatbox for us.

I don't think we as a society owe it to any profession to artificially keep it alive as it's historically been. We do it owe it to INDIVIDUALS -- fellow citizens/residents -- to provide them with some way forward, but I'd prefer that be reskilling and social support programs, welfare if nothing else, rather than using ancient copyright law to favor old dying industries over new ones that can actually have a much bigger impact.

In my eyes, the NYT is just another news outlet. A decent one, sure, but not anything substantially different than WaPo or the LA Times or whatever. How many Pulitzer winners have come and gone? https://en.wikipedia.org/wiki/Pulitzer_Prize_for_Breaking_Ne...

If we lost the NYT, it'd be a bit of nostalgia, but next week life would go on as usual. They're not even as specialized as, say, National Geographic or PopSci or The Information or 404 Media or The Center for Investigative Reporting, any of which would be harder to replace than another generic big news outlet.

AI, meanwhile, has the potential to be way bigger than even the Internet, IMO, and we should be devoting Manhattan Project-like resources to it.

◧◩◪◨⬒
732. hn_thr+Wu1[view] [source] [discussion] 2023-12-27 22:38:04
>>graphe+mh1
Per my other comment here, >>38784723 , courts have previously ruled that whether people would cancel their NYT subscription is irrelevant to that test.
741. teloto+gy1[view] [source] 2023-12-27 22:58:01
>>ssgodd+(OP)
Complaint: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...
◧◩
772. bitsag+lX1[view] [source] [discussion] 2023-12-28 02:53:41
>>pelora+d6
Does allowing a model to train on copyrighted material implicitly mean associated output would also be legal? They plan to expand upon this decision, but I’m curious in the meantime.[1]. I’d assume this NYT problem would still exist in Japan.

[1]https://asia.nikkei.com/Business/Technology/Japan-panel-push...

◧◩◪◨⬒⬓⬔⧯
774. anigbr+EY1[view] [source] [discussion] 2023-12-28 03:07:35
>>tremon+801
No it is not. You can make a better argument than just BSing.

https://libraries.emory.edu/research/copyright/copyright-dat...

◧◩◪◨⬒
792. Captai+0q2[view] [source] [discussion] 2023-12-28 08:22:50
>>blagie+ol1
Thanks, but it's not my poem! You can find it here: https://blog.ninapaley.com/2009/12/15/minute-meme-1-copying-...
◧◩◪◨⬒⬓⬔⧯
794. edwint+Ty2[view] [source] [discussion] 2023-12-28 09:50:46
>>TheCor+7R
It is questionable whether that filtering mechanism works, previous discussion: >>33226515

But even if it did an exact match search is not enough here. What if you take the source code and rename all variables and functions? The filter wouldn't trigger, but it'd still be copyright infringement (whether a human or a machine does that).

For such a filter to be effective it'd at least have to build a canonical representation of the program's AST and then check for similarities with existing programs. Doing that at scale would be challenging.

Wouldn't it be better to: * Either not include copyrighted content in the training material in the first place * Explicitly tag the training material with license and origin infornation, such that the final output can produce a proof of what training material was relevant for producing that output and don't mix differently licensed content.

795. edwint+8z2[view] [source] 2023-12-28 09:53:16
>>ssgodd+(OP)
They are not the only ones to sue. There is also a class action: https://githubcopilotlitigation.com/
◧◩
825. dang+r44[view] [source] [discussion] 2023-12-28 19:37:39
>>superd+Ic1
We've banned this account for posting unsubstantive and/or flamebait comments and using HN primarily for ideological battle. That's not what this site is for.

If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.

◧◩◪◨⬒⬓⬔
835. fennec+JY4[view] [source] [discussion] 2023-12-29 02:10:14
>>ahepp+iN4
Sure, it definitely spits out facts, often not hallucinating. And it can reiterate titles and small chunks of copyright text.

But I still haven't seen a real example of it spitting out a book verbatim. You know where I think it got chunks of "copyright" text from GRRM's books?

Wikipedia. And https://gameofthrones.fandom.com/wiki/Wiki_of_Westeros, https://awoiaf.westeros.org/index.php/Main_Page, https://data.world/datasets/game-of-thrones all the god dammed wikis, databases etc based on his work, of which there are many, and of which most quote sections or whole passages of the books.

Someone prove to me that GPT can reproduce enough text verbatim that it makes it clear that it was trained on the original text first hand basis, rather than second hand from other sources.

851. nojvek+XWd[view] [source] 2024-01-01 21:38:10
>>ssgodd+(OP)
> For example, in 2019, The Times published a Pulitzer-prize winning, five-part series on predatory lending in New York City’s taxi industry. The 18-month investigation included 600 interviews, more than 100 records requests, large-scale data analysis, and the review of thousands of pages of internal bank records and other documents, and ultimately led to criminal probes and the enactment of new laws to prevent future abuse.

> OpenAI had no role in the creation of this content, yet with minimal prompting, will recite large portions of it verbatim.

This is the smoking gun. GPT-4 is a large model and hence highly likely to reproduce content. They have many such examples in the court filing https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

IANAL but that's a slam dunk of copyright violation.

NYT will likely win.

Also why OpenAI should not go YOLO scaling up to GPT-5 which will likely recite more copyrighted content. More parameters, more memorization.

[go to top]