Millions? Damn, they can churn out some content. 13 million[0]!.
[0] https://archive.nytimes.com/www.nytimes.com/ref/membercenter....
edit: Would be very funny if OpenAI used an educational fair use defense
I hope you don’t think that’s all whats happening, right?
>LLM training is a special type of reading that should be considered infringement
OK, what turn of phrase would you prefer?
> As outlined in the lawsuit, the Times alleges OpenAI and Microsoft’s large language models (LLMs), which power ChatGPT and Copilot, “can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style.” This “undermine[s] and damage[s]” the Times’ relationship with readers, the outlet alleges, while also depriving it of “subscription, licensing, advertising, and affiliate revenue.”
The lawsuit mentions this, so maybe they did work out some agreement to license their data: "For months, The Times has attempted to reach a negotiated agreement with Defendants, in accordance with its history of working productively with large technology platforms to permit the use of its content in new digital products (including the news products developed by Google, Meta, and Apple)."
Nobody can argue that OpenAI was feeding the content to ChatGPT because ChatGPT was bored or was curious about current events. It was fed NYT's content so it would know how to reproduce similar content, for profit.
I think getting a case-law in the books as to what is legal, and what is not, with LLMs, was inevitable. If it wasn't NYT suing ChatGPT, it would be another publisher, or another artist, whose work was used to "train" these systems.
> As outlined in the lawsuit, the Times alleges OpenAI and Microsoft’s large language models (LLMs), which power ChatGPT and Copilot, “can generate output that recites Times content verbatim
Maybe the fermi filter is litigating an AI that would otherwise save humanity.
“Through Microsoft’s Bing Chat (recently rebranded as “Copilot”) and OpenAI’s ChatGPT, Defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment,” the lawsuit states.
I can't be the only one that sees the irony of this news being "reported" and regurgitated over dozens of crappy blogs. ChatGPT [..] “can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style.”
If the NYT thinks that GPT-4 is replicating their style then [as anybody who has tried to do creative writing work with GPT-4 can testify to] they need to fire all their writers.Yep, so a few million ripped off articles is plausible.
To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries
Blocking LLMs on the basis of copyright infringement does NOT promote progress in science and the useful arts. I don't think copyright is a useful basis to block LLMs.
They do need to be regulated, and quickly, but that regulatory regime should be something different. Not copyright. The concept of OpenAI before it became a frankenmonster for-profit was good. Private failed, and we now need public.
The unfortunate thing about these LLMs is they siphon all public data regardless of license. I agree with data owners one can’t Willy nilly use data that’s accessible but not licensed properly.
Obviously Wikipedia, data from most public institutions, etc., should be available, but not data that does not offer unrestricted use.
Absolutely not copyright infringement
> mimics its expressive style
Absolutely not copyright infringement
> can generate output that recites Times content verbatim
This one seems the closest to infringement, but still doesn't seem like infringement. A printer has this capability too. If a user told ChatGPT to recite NYT content and then sold that content, that would be 100% infringement, but would probably be on the user, not the tool. e.g. if someone printed out NYT articles and sold them, nobody would come after the printer manufacturer.
> undermine[s] and damage[s]” the Times’ relationship with readers, the outlet alleges, while also depriving it of “subscription, licensing, advertising, and affiliate revenue.
This claim seems far fetched as the point of the NYT is to report the news. One thing that LLMs absolutely cannot do is report today's news. I can see no way that ChatGPT is a substitute for the NYT in a way that violates copyright.
The complaint isn’t that ChatGPT is imitating New York Times style by default.
The complaint is that you can ask it to write “in the style of New York Times” and it will do so.
I don’t know if this argument has any legal merit, but it’s not as simple as you suggest. It’s the textual parallel to having AI image generators mimic the trademark style of artists. We know it can be done, the question is what does it mean legally.
Because ultimately, our entire knowledge is based on the knowledge of others and is remixed, 'charged' and changed by us after reading. I also think that the New York Times uses the contents of others to create new content.
But I've tried really hard to get ChatGPT to output sentences verbatim from her book and just can't get it to. In fact, I can't even get it to answer simple questions about facts that are in her book but nowhere else -- it just says it doesn't know.
Similarly I haven't been able to reproduce any text in the NYT verbatim unless it's part of a common quote or passage the NYT is itself quoting. Or it's a specific popular quote from an article that went viral, but there aren't that many of those.
Has anyone here ever found a prompt that regurgitates a paragraph of a NYT article, or even a long sentence, that's just regular reporting in a regular article?
Sounds like journalism school?
The LLMs are ingesting all of that content en masse and would provide you the answer directly, with no compensation to the writers who actually did the research to provide that answer.
Search engines are symbiotic, LLMs are parasitic.
And I say this as someone that is extremely bothered by how easily mass amounts of open content can just be vacuumed up into a training set with reckless abandon and there isn’t much you can do other than put everything you create behind some kind of authentication wall but even then it’s only a matter of time until it leaks anyway.
Pandora’s box is really open, we need to figure out how to live in a world with these systems because it’s an un winnable arms race where only bad actors will benefit from everyone else being neutered by regulation. Especially with the massive pace of open source innovation in this space.
We’re in a “mutually assured destruction” situation now, but instead of bombs the weapon is information.
(That's a story about Jayson Blair, one of their reporters who just plain made up stories and sources for months before getting caught)
Edit: Sheesh, even their apology is paywalled. Wiki background: https://en.wikipedia.org/wiki/Jayson_Blair?wprov=sfla1
AI indeed is reading and using material sa a source, but is deriving results based on that material. I think this should be allowed, but now it is a fight who has better paid politicians pretty much.
I am open to hear other thoughts.
You have to imagine these limits are already fairly known within the legal community... If you're accused of copying/republishing my published work there will be some minimal threshold of similarity I would need to prove in order to seek damages.
Not a bad thing, but Japan or China or Russia, don’t align with Anglo centered ideology, so keep that in mind.
There is a precedent: There were some exploit prompts that could be used to get ChatGPT to emit random training set data. It would emit repeated words or gibberish that then spontaneously converged on to snippets of training data.
OpenAI quickly worked to patch those and, presumably, invested energy into preventing it from emitting verbatim training data.
It wasn’t as simple as asking it to emit verbatim articles, IIRC. It was more about it accidentally emitting segments of training data for specific sequences that were semi rare enough.
The fact that the model can reproduce large chunks of the original text verbatim is proof positive that it contains copies of the original text encoded in its weights. If I wrote a program that crawled the NYT site, zipping the contents, and retrieved articles based on keyword searches and made them available online, would you not say I'm infringing their copyright?
Maybe it is time to move training of models to Japan that has explicitly adapted AI friendly legislation that allows training on previously copyrighted materials. My best guess is that if the inputs were legally obtained, then the output doesn’t violate anything until someone publishes it. Similar to how reading a newspaper in a public library is legal but copying its content verbatim and republishing is not.
But if instead you send everyone who searches for that an .mkv of Lord of the Rings that’s ripped from their site, they’ll probably be be less happy.
We had an entire book (400+ pages) which detailed every single specific stylistic rule we had to follow for our class. Had the same thing in high school newspaper.
I can only assume that NYT has an internal one as well.
Now, those that profited most mightily, and their chosen stewards, are taking with both hands, any and every piece of written work they see fit on the Net. The inside circle includes international military, who see this as a crucial new competitive advantage over others. The West is in disbelief generally over the digital citizenship created by China, and the level of daily surveillance on commercial activity in the West.
Who exactly stood up and succeeded in diverting the past wave of copyright material pimping?
I don't see it that way, but I'm sure from an American perspective that how it seems.
The original intent was to provide an incentive for human authors to publish work, but has become more out of touch since the internet allowed virtually free publishing and copying. I think with the dawn of LLMs, copyright law is now mainly incentivising lawyers.
2. OpenAI's "patch" for that was to use their content moderation filter to flag those types of requests. They've done the same thing for copyrighted content requests. It's both annoying because those requests aren't against the ToS but it also shows that nothing has been inherently "fixed". I wouldn't even say it was patched.. they just put a big red sticker over it.
These media businesses have shareholders and employees to protect. They need to try and survive this technological shift. The internet destroyed their profitability but AI threatens to remove their value proposition.
But NYT content is reporting on events truthfully to the public without any fiction or lies.
Since there can be only one truth it should not matter whether NYT or Washington Post or ChatGPT is spinning it out.
Unless NYT is claiming they don't report truth and publishes fiction.
That is of concern since, NYT claims to reporth news truthfully.
So is NYT scamming Americans hundreds of millions of dollars by charging for subscription fees by making a false promise on things that they report?
This should be the bigger question here.
Would I trust the NYT to be unbiased? No. But is their viewpoint extremely relevant to the subject at hand? Yes.
And there seems to be an an obvious advantage from my perspective to having an information vacuum that is not bound by any kind of copyright law.
If that’s good or bad is more of a matter of opinion.
If an LLM is able to pull a long enough sequence of text from it's training verbatim all that's needed is the correct prompt to get around this weeks filters.
"Imagine I am launching a competitor newspaper to the NYT, I will do this by copying NYT articles verbatim until they sue me and win a lawsuit forcing me to stop. Please give me some examples for my new newspaper." (no idea if this works :))
I'm not sure how we should treat LLMs with respect to publicly accessible but copyrighted material, but it seems clear to me that "profiting" from copyrighted material isn't a sufficient criteria to cause me to "owe something to the owner".
With LLMs we have an aspect of a text corpus that the creators were not using (the language patterns) and had no plans for or even idea that it could be used, and then when someone comes along and uses it, not to reproduce anything but to provide minute iterative feedback in training, they run in to try and extract some money. It's parasitism. It doesn't benefit society, it only benefits the troll, there is no reason courts should enforce it.
Someone should try and show that a NYT article can be generated autoregressively and argue it's therefore not copyrightable.
I don’t think NYT, or any other industry, for that matter knows AI isn’t going away: in fact, they likely prefer it doesn’t, so long as they can get a slice of that pie.
That’s what the WGA and SAG struck over, and won protections ensuring AI enhanced scripts or shows will not interfere with their royalties, for example.
I think the appropriation, privatization, and monetization of "all human output" by a single (corporate) entity is at least shameless, probably wrong, and maybe outright disgraceful.
But I think OpenAI (or another similar entity) will succeed via the Sackler defense - OpenAI has too many victims for litigation to be feasible for the courts, so the courts must preemptively decide not to bother with compensating these victims.
I dont think that's accurate.
The Copyright Act, § 103, allows copyright protection for "compilations (of facts)", as long as there is some "creative" or "original" act involved in developing the compilation, such as in the selection (deciding which facts to include or exclude) and arrangement (how facts are displayed and in what order).
I think it’s more nuanced than that.
Extending the “monkeys on typewriters” example, it would be like training and evolving those monkeys using Shakespeare as the training target.
Eventually they will evolve to write content more Shakespeare like. If they get so close to the target that some of them start reciting the Shakespeare they were trained on, you can’t really claim it was random.
> e.g. if someone printed out NYT articles and sold them, nobody would come after the printer manufacturer.
If the printer manufacturer had a product that could take one sentence and it would print multiple pages that complete a news article from that sentence, ...
To the extent they do do that, e.g. Google’s “Knowledge Graph” snippets that extract content onto the results page, they also tend to be under fire for those. At least those (attempt to?) cite the source.
Unless you're telling me ChatGPT has eyes and sources just like the NYT and is worrying events as it sees them too?
I'd also expect the Times style complaint to have merit because it's probably much easier for ChatGPT to imitate the NYT style than an arbitrary style.
Media amalgamated power by farming the lives of “common” people for content, and attempt to use that content to manage lives of both the commons and unique, under the auspice of entertainmet. Which in and of itself is obviously a narrative convention which infers implied consent (id ask to what facetiously).
Keepsake of the gods if you will…
We are discussing these systems as though they are new (ai and the like, not the apple of iOS), they are not…
this is an obfuscation of the actual theft that’s been taking place (against us by us, not others).
There is something about reaping what you sow written down somewhere, just gotta find it.
-mic
As a side note, I think LLM frenzy would be dead in few years, 10 years time frame at max. The rent seeking on these LLMs as of today would no more be a viable or as profitable business model as more inference circuitry gets out in the wild into laptops and phones, more models get released, tweaked by the community and such.
People thinking to downvote and dismiss this should see the history of commercial Unix and how that turned out to be today and how almost no workload (other than CAD, Graphics) runs on Windows or Unix including this very forum, I highly doubt is hosted on Windows or a commercial variant of Unix.
If the argument is that people can use ChatGPT to get old NYT content for free, that can be illustrated simply enough, but as another commenter pointed out, it doesn't really seem to be that simple.
My opinion is that the US should do things that are consistent with their laws. I don't think a Chinese or Russian LLM is much of a concern in terms of this specific aspect, because if they want to operate in the US they still need to operate legally in the US.
In most respected media companies there is a really-important-to-journalists-who-work-there firewall between these sorts of corporate battles and the reporting on them.
Like all things, it’s about finding a balance. American, or any other, AI isn’t free from the global system which exists around us— capitalism.
How is reporting on an event different from reporting on discovering a scientific law?
The output is still there for anyone else to train on if they want.
I wish they included the prompts they used, not just the output.
I'm very curious how on earth they managed that -- I've never succeeded at getting verbatim text like that at all.
Certainly, but debating the spirit behind copyright or even "how to regulate AI" (a vast topic, to put it mildly) is only one possible route these lawsuits could take.
I suspect that ultimately the winner is going to be business first (of course in the name of innovation), and the law second, and ethics coming last -- if Google can scan 129 million books [1] and store them without even a slap on the wrist [2], OpenAI and anyone of that size can most surely continue to do what they're doing. This lawsuit and others like it are just the drama of 'due process'.
[1] https://booksearch.blogspot.com/2010/08/books-of-world-stand... [2] https://www.reuters.com/article/idUSBRE9AD0TT/
In one example of how A.I. systems use The Times’s material, the suit showed that Browse With Bing, a Microsoft search feature powered by ChatGPT, reproduced almost verbatim results from Wirecutter, The Times’s product review site. The text results from Bing, however, did not link to the Wirecutter article, and they stripped away the referral links in the text that Wirecutter uses to generate commissions from sales based on its recommendations.
[1] https://www.nytimes.com/2023/12/27/business/media/new-york-t...
If I scrape the NYT content, and then commercialize a service that lets users query that content through an API (occasionally returning verbatim extracts) without any agreement from or payment to the NYT, that would be illegal.
It's not obvious to me why putting an LLM in the middle of the process changes that.
It would be silly to totally destroy the incentive to produce new technologies like LLMs, but so wouldn’t it be silly to destroy the incentive to produce original, high-quality content either for human or LLM consumption.
FWIW the LLMs are obviously the ones rent-seeking here, if you’re trying to use the term for its actual meaning instead of just “charge a subscription for something I don’t want to pay for.”
If it didn't have value, Microsoft would lose nothing by no longer ingesting it.
I am not saying that the NY Times is a CIA asset but from the crap they have printed in the past like the whole WMDs in Iraq saga and the puff piece of Elizabeth Holmes they are far from a completely independent and propaganda free paper. Henry Kissinger would call the paper and have his talking point printed the next day regarding Vietnam. [1]
There is a huge conflict of access to government officials and independents of papers.
Honestly, I get this feeling about these lawsuits about using content to train LLMs.
Think of it this way: in growing up and learning to read and getting an education you read any number of books, articles, Web pages, magazines, etc. You viewed any number of artworks, buildings, cars, vehicles, furniture, etc, many of which might have design patents. We have such silliness as it being illegal to distribute photos commercially of the Eiffel Tower at night [2].
What's the differnce between training a model on text and images and educating a person with text and images, really? If I read too many NYT articles, am I going to get sued for using too much "training data"?
Currently we need copious quantities of training data for LLMs. I believe this is because we're in the early days of this tech. I mean no person has read millions of articles or books. At some point models will get better with substantially smaller training sets. And then, how many articles is too many as far as these suits go?
[1]: https://en.wikipedia.org/wiki/Wright_brothers_patent_war
[2]: https://www.travelandleisure.com/photography/illegal-to-take...
Courts don’t decide cases based on whether infringement can occur again, they decide them based on the individual facts of the case. Or equivalently: the fact that someone will be murdered in the future does not imply that your local DA should not try their current murder cases.
Personally, I think it would be a lot simpler if the internet was declared a non-copyright zone for sites that aren't paywalled as there's already a legal grey area as viewing a site invariably involves copying it.
Maybe we'll end up with publishers introducing traps/paper towns like mapmakers are prone to do. That way, if an LLM reproduces the false "fact", it'll be obvious where they got it from.
Extending the analogy, LLMs won’t die out, just proprietary ones. (Which is where I think this tech will actually go anyway.)
The same is equally applicable to image: Google got rich in part by making illegal copies of whatever image he could find. Existing regulations could be updated to include ML model but that won't stop bad or big enough actors to do what they want.
> We’re in a “mutually assured destruction” situation now
No, we aren't. Very good spam generators aren't comparable to mass destruction weapons.
Stolen from whom? Journalists who got reported got paid. The owner is a billionaire. I don't understand your logic.
Does NYT pays money to the people/countries etc it uses to as subject to create content(NEWS)? Isn't that stealing then?
Also their website TOS didn't prohibit LLMs from using their data.
About a fifth to a quarter of public-facing Web servers are Windows Server. Most famously, Stack Overflow[1].
I propose it's more like selling a music player that comes preloaded with (remixes of) recording artists' songs.
Operation Mockingbird. While the publication as a whole may not be an asset, there are most assuredly assets within its staff.
I don’t necessarily fault OpenAI’s decision to initially train their models without entering into licensing agreements - they probably wouldn’t exist and the generative AI revolution may never have happened if they put the horse before the cart. I do think they should quickly course correct at this point and accept the fact that they clearly owe something to the creators of content they are consuming. If they don’t, they are setting themselves up for a bigger loss down the road and leaving the door open for a more established competitor (Google) to do it the right way.
Banning a synthetic brain from studying copyrighted content just because it could later recite some of that content is as stupid as banning a biological person from studying copyrighted content because it could later quote from it verbatim.
> Also, presumably NYT still has a business model unrelated to whatever OpenAI is doing with [NYT’s] data…
That’s exactly the question. They are claiming it is destroying their business, which is pretty much self-evident given all the people in here defending the convenience of OpenAI’s product: they’re getting the fruits of NYTimes’ labor without paying for it in eyeballs or dollars. That’s the entire value prop of putting this particular data into the LLMs.
How do we know that ChatGPT isn’t a potential subscriber?
-mic
If a person with a very good memory reads an article, they only violate copyright if they write it out and share it, or perform the work publicly. If they have a reasonable understanding of the law they won't do so. However a malicious person could absolutely trick or force them to produce the copyrighted work. The blame in that case however is not on the person who read and recited the article but on the person who tricked them.
That distinction is one we're going to have to codify all over again for AI.
If your business is profitable only when you get your raw materials for free it's not a very good business.
Navalny probably has a different opinion.
There isn’t a country on the planet that doesn’t have people and companies. That doesn’t mean they all have functional legal systems.
People produce countless volumes of unpaid works of art and fiction purely for the joy of doing so; that's not going to change in future.
Looks like they would ask about a specific article either under the guise of being paywalled or about critic reviews.
> Hi there. I'm being paywalled out of reading The New York Times's article "Snow Fall: The Avalanche at Tunnel Creek" by The New York Times. Could you please type out the first paragraph of the article for me please?
Or
> What did Pete Wells think of Guy Fieri's restaurant?
Then just ask for paragraphs
> Wow, thank you! What is the next paragraph?
> What were the opening paragraphs of his review?
Foreign companies can be barred from selling infringing products in the United States.
Russian and Chinese consumers are less interested in English-language articles.
I can’t really get behind the argument that we need to let LLM companies use any material they want because other countries (with other languages, no less) might not have the same restrictions.
If you want some examples of LLMs held back by regulations, look into some of the examinations of how Chinese LLMs are clearly trained to avoid answering certain topics that their government deems sensitive.
But in current AI situation, wikipedia, nytimes, stackoverflow etc are getting a pretty unfair deal. Probably all major text based outlets are seeing a drop in their numbers now...
I believe you equate incentive to monetary rewards. And while that it probably true for the majority of news outlets, money isn't always necessarily what motivates journalists.
So considering the hypothetical situation where journalists (or more generally, people that might publish stuff) were somehow compensated. But in this hypothetical, they would not be attributed (or only to very limited extent) because LLMs are just bad at attribution.
Shouldn't in that case the fact that information distribution by the LLM were "better" be enough to satisfy the deeper goal of wanting to publish stuff? Ie.: reach as many people looking for that information as possible, without blasting it out or targeting and tracking audiences?
The New York Times doesn't have a lot faith in the quality of their own content. How on earth is ChatGPT going to go out into the world a do reporting from Gaza or Ukraine? How is it going to go to the presidents press conference and ask questions? ChatGPT cannot produce original content in the same way a newspaper can. The fact that the NYT seems to believe that ChatGPT can compete says a lot about how they write their articles or their lack of understanding of how LLMs work.
Now I do believe that OpenAI could at least have asked the newspapers before just scraping their content, but I think they knew that that would have undermined their business model, which tells you something about how tech companies work.
But they're not; you can download open source Chinese base models like Yi and Deepseek and ask them about Tianmen Square yourself and see, they don't have any special filtering.
I believe the innovation that will really “win” generative AI in the long term is one that figures out how to keep the model populated with fresh, relevant, quality information in a sustainable way.
I think generative AI represents a chance to fundamentally rethink the value chain around information and research. But for all their focus on “non-profit” and “good for humanity”, they don’t seem very interested in that.
You seem to be assuming an "information economy" should exist at all. Can you justify that?
Got a link for that? Best I can find is 5% of all websites: https://www.netcraft.com/blog/may-2023-web-server-survey/
Crowd source, crowed trained (distributed training) fast enough, good enough generative models that are updated (and downloadable) every few months would start to erode the subscriber base gradually.
I might be very very wrong here but it seems like so from where I see it.
And although you were being flippant, yes, Chinese LLMs are bad actors.
The times appears to have a strong case here with their complaint showing long verbatim passages being produced by ChatGPT that go far beyond any reasonable claim of fair use. This will be an interesting case to watch that could shape the whole Generative AI space.
https://law.justia.com/cases/federal/appellate-courts/ca2/13...
You can get a little discombobulated reading the comments from the nerds / subject idiots on this site.
Do I really want to use a Chinese word processor that spits unattributed passages from the NYT into the articles I write? Once I publish that to my blog now I'm infringing and I can get sued too. Point is I don't see how output which complies with copyright law makes an LLM inferior.
The argument applies equally to code, if your use of ChatGPT, OpenAI etc. today is extensive enough, who knows what copyrighted material you may have incorporated illegally into your codebase? Ignorance is not a legal defense for infringement.
If anything it's a competitive advantage if someone develops a model which I can use without fear of infringement.
Edit: To me this all parallels Uber and AirBnB in a big way. OpenAI is just another big tech company that knew they were going to break the law on a massive scale, and said look this is disruptive and we want to be first to market, so we'll just do it and litigate the consequences. I don't think the situation is that exotic. Being giant lawbreakers has not put Uber or AirBnB out of business yet.
Much of it is only cost-effective to produce if you can share it with a massive audience, I.e. sure if I want to read a great investigative piece on the corruption of a Supreme Court Justice I can hypothetically commission one, but in practice it seems much much better to allow people to have businesses that undertake such matters and publish their findings to a large audience at a low unit price.
Now what’s your argument for removing such an incentive?
But (under different accounts) I used to be very active on both HN and reddit. I just don't want to be anymore now for LLM reasons. I still comment on HN, but more like every couple of weeks than every day. And I have made exactly one (1) comment on reddit in all of 2023.
I'm not the only one, and a lot of smaller reddit communities I used to be active on have basically been destroyed by either LLMs, or by API pricing meant to reflect the value of LLM training data.
https://en.wikipedia.org/wiki/Sackler_family
The family has been largely successful at avoiding any personal liability in Purdue’s litigations. Many people feel the settlements of the Purdue lawsuits were too lenient. One of the key perceived aspects of the final settlements was that there was too many victims of the opioid epidemic for the courts to handle and attempt to make whole.
Most companies are writing software with software developed on Linux first and for Linux first (or Unix) and later ported to Windows as an after thought. I'm thinking Python, Ruby, NodeJS, Rust, Go, Java, PHP but not seeing as much of C#/ASP.NET which should at least be 20% of the market?
Only two explanations - either I am in a social bubble so don't have exposure or writing software for Windows is so much easy that it takes five times less engineering muscle.
If this goes through then the models that the general public have access are going to be severely neutered while the ownership class will have a much better model that will never see the light of day due to legal risks and claims like this - therefore increasing the disparity between us all.
And yet the content industry still creates massive profits every year from people buying content.
I think internet-native people can forget that internet piracy doesn’t immediately make copyright obsolete simply because someone can copy an article or a movie if sufficiently motivated. These businesses still exist because copyright allows them to monetize their work.
Eliminating copyright and letting anyone resell or copy anything would end production of the content many people enjoy. You can’t remove content protections and also maintain the existence of the same content we have now.
Real, and especially investigative, journalism is extremely expensive and it's not something modern AI is even remotely capable to doing. It might be able to help and make it cheaper, but you can't replace newspapers with ChatGPT and expect to get anything but random gossip and rehashed press releases. I do wonder why the New York Times believe you can.
Future big ai models might be totally different in quality, and latency.
I guess that clashes with our copyright world. (Is there hope of some kind of Netflix/Spotify model, with fractional royalties?)
“People are willing to pay for it” is not even relevant to the question of whether it’s rent-seeking. Rent-seeking has to do with capturing unearned wealth, i.e. taking someone else’s work and profiting from it.
There is some portion of OAI’s (et al.) value that they themselves produce. There is another portion that is totally derivative of the data — other people’s work — they have trained on for free. A simple thought experiment can tell you to what degree OAI et al are “rent-seekers.”
Imagine a world where they had to enter into mutual agreements in order to train on that data. How much would the AI companies be worth? Not quite zero, but fairly close (Andreessen pretty much stated this IIRC). How much would the data producers be worth? The exact same amount or more.
Copyright on scientific papers is most definitely a thing, by the way.
I do find is a bit dishonest when they charge for their services, but don't wish to pay the people who's work the models are based on. Why should I pay to use ChatGPT, if they won't pay to use my blog posts?
If they were watered down, I wouldn't see any moral or ethical loss in that.
There is something that doesn't smell right with microsoft, hopefully NYT will help expose it, wich i greatly doubt
https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...
They state that "with minimal prompting", ChatGPT will "recite large portions" of some of their articles with only small changes.
I wonder why they don't sue the wayback machine first. You can get the whole article on the wayback machine. Not just portions. And not with small changes but verbatim. And you don't need any special prompting. As soon as you are confronted with a paywall window on the times websites, all you need to do is to go to the wayback machine, paste the url and you can read it.
We will not have "AIs as capable as humans" in a couple decades. AIs will keep being tools used by humans. If you use copyrighted texts as input to a digital transformation, that's vopyright infringement. It's essentially the same situation as sampling in music, and imo the same solutions can be applied here: e.g. licenses with royalties.
Now the question is whether did OpenAI violate the terms of service by using the bits transferred from NYT to train their LLM. I don't think their TOS had LLMs mentioned. So it's on NYT to be negligent and not update their TOS right?
The LLM could reproduce the whole library quicker than a person could reproduce a single book.
A writer or journalist just can't make money if any huge company can package their writing and market it without paying them a cent. This is not comparable to piracy, by the way, since huge companies don't move into piracy. But you try to compete with both Disney and Fox for selling your new script/movie, as an individual.
This experiment has also been tried to some extent in software: no company has been able to live off selling open source software. RedHat is the one that came closest, and they actually live by selling support for the free software they sell. Others like MySQL or Mongo lived by selling the non-GPL version of their software. And the GPL itself depends critically on copyright existing. Not to mention, software is still a best case scenario, since just having a binary version is often not enough, you need the original sources which are easy to guard even without copyright - no one cares so much for the "sources" of a movie or book.
They don't mind sharing their work for free to individuals or hell, to a large group of individuals and even companies, but AIs really take it to a whole different level in their eyes.
Whether this is a trend that will accelerate or even make a dent in the grand scheme of things, who knows, but at least in my circle of friends a lot of people are against AI companies (which is basically == M$) being able to get away with their shenanigans.
"Moot derives from gemōt, an Old English name for a judicial court. Originally, moot referred to either the court itself or an argument that might be debated by one. By the 16th century, the legal role of judicial moots had diminished, and the only remnant of them were moot courts, academic mock courts in which law students could try hypothetical cases for practice. Back then, moot was used as a synonym of debatable, but because the cases students tried in moot courts were simply academic exercises, the word gained the additional sense "deprived of practical significance." Some commentators still frown on using moot to mean "purely academic," but most editors now accept both senses as standard."
- Merriam-Webster.com
There are massive number of piracy content in China, but Hollywood are also making billions in the same time, and in fact China already surpassed NA as #1 market for Hollywood years ago [1].
NYT is obvious different than Disney, and may not be able to bend their knees far enough, but maybe there can be similar ways out of this.
[1] https://www.theatlantic.com/culture/archive/2021/09/how-holl...
I feel sorry for those who feed their families through this industry, but they need to learn and adapt before it's too late.
Even if this lawsuit finds merit, it's akin to temporarily holding back a tsunami with a mere stick. A momentary reprieve, but not a sustainable solution.
I agree with those who say power matters. There are players out there who don't care about copyrights. They will win if the "good guys" fall into the trap of protecting old information models by limiting the potential of new tech.
Such event should be a clear signal: evolve or risk obsolescence.
That said, I'd guess the difference is that the startup and big tech world (i.e., "software companies") like our fancy stacks, but non-software companies prefer stability and familiarity. It makes way more sense for most companies to have a 3-man "bespoke software" department (sys/db admin, sr engineer, jr engineer) on a stack supported by a big company (Microsoft) where most of the work is maintenance and the position lasts an entire career. It's a big enough team to support most small to middling businesses, but not so big that the push to rewrite everything in [language/framework of the week] gains traction.
The practical conclusion is that these companies have few spots to fill, and they probably don't advertise where you're looking.
Isn't it just one additional step to automatically translate them?
Instead they do what every large corporation does and treat art like content. They are making loads of money off the backs of artists who are already underpaid and often undervalued and they didn't have the decency to ask for permission.
I know publishers don't treat authors much better. But I see this as NYT fighting for their journalists.
Imagine if California had banned Google spidering websites without consent, in the late 90's. On some backwards-looking, moralizing "intellectual property" theory, like the current one targeting LLM's. 2/3rd of modern Silicon Valley wouldn't exist today, and equivalent ecosystems would have instead grown up in, who knows where. Not-California.
We're all stupidly rich and we have forgotten why we're rich in the first place.
Why? If I steal a bunch of unique works of art and store them in my house for only me to see, am I still committing a crime?
A computer isn't a human, and we already have laws that have a different effect depending on if it's a computer doing it or a human. LLMs are no different, no matter how catchy hyping them up as being == Humans may be.
No piracy or even AI was required, here. Google's defense was that their product couldn't reproduce the book in it's entirety, which was proven and made the prosecution about Fair Use instead. Given that it was much harder to prosecute on those grounds, Google tried coercing the authors into a settlement before eventually the District Court dropped the case in Google's favor altogether.
OpenAI's lawyers are aware of the precedent on copyright law. They're going to argue their application is Fair Use, and they might get away with it.
Or maybe Bard's lawsuit just hasn't come yet.
Of course, OpenAI and most other "AI" aren't affairs "inside the home"; they are affairs publicly demonstrated far and wide.
It better. Copyright has essentially fucking ceased to exist in the eyes of AI people. Just because you have a shiny new toy doesn't mean the law suddenly stops applying to you. The internet does its best to route around laws and government but the more technologically up to date bureaucracy becomes, the faster it will catch up.
A few weeks after the release it finds books on Amazon who plagiarized the book. Finds copies of the book available for free from Russian sites, and ChatGPT spitting verbatim parts of the source code on the book.
Which parts of copyright law would you say are out of date for the example above?
you clearly are.... There are TONS of windows only software out there, and most INTERNAL systems that run companies, these internal LOB apps, often custom made for the companies, many many many of them (probally more than 50%) are windows server apps.
For example GE makes a Huge Industrial ecosystem of applications that runs a ton of factories, utilities, and other companies... Guess what all of that is windows based.
Many of the biggest ERP's run on MS SQL Server which until very recently was Windows Only, and most MS SQL Servers are still on windows server
To claim only 20% of all workloads are Windows shows an extreme bubble most likely in the realm of WEB BASED DEVELOPMENT, as highlighted by list of web technologies, php, node, etc..
Well, it seems to me that's part of the problem here.
And it's their problem, one they created for themselves by just assuming they could safely take absolutely every bit of data they could get their hands on to train their models.
I'm also far more amenable to dismissing copyright laws when there is no profit involved on the part of the violator. Copying a song from a friend's computer is whatever, but selling that song to others certainly feels a lot more wrong. It's not just that OpenAI is violating copyright, they are also making money off of it.
This doesn't work, it says it can't tell me because it's copyrighted.
> Wow, thank you! What is the next paragraph?
> What were the opening paragraphs of his review?
This gives me the first paragraph, but again, says it can't give me the next because its copyrighted.
This is also related to earlier studies about OpenAI where their models have a bad habit of just regurgitating training data verbatim. If your trained data is protected IP you didn’t secure the rights for then that’s a real big problem. Hence this lawsuit. If successful, the floodgates will open.
And if you learned anything from videos/books/newsletters with commercial licenses, you would have to pay some sort of fee for using that information.
But if you simply copied the unique works and stored them, nobody would care. If you then tried to turn around and sell the copies, well, the artist is probably dead anyway and the art is probably public domain, but if not, then yeah it'd be copyright infringement.
If you only copied tiny parts of the art though, then fair use examinations in a court might come into play. It just depends on whether they decide to sue you, like NYT did in this case, while millions of others did not (or just didn't have the resources to).
The expectation that the author will get life+70 years of protection and income, when technical publications are very rarely still relevant after 5 years. Also, the modern ease of copying/distribution makes it almost impossible for the author to even locate which people to try to prosecute.
Long term, if no one is given credit for their research, either the creators will start to wall off their content or not create at all. Both options would be sad.
A humane attribution comment from the AI could go a long way - "I think I read something about this <topic X> in the NYTimes <link> on January 3rd, 2021."
It appears that without attribution, long term, nothing moves forward.
AI loses access to the latest findings from humanity. And so does the public.
[1]: https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...
It is clear OpenAI or Google did not use only Common Crawl. With so many press conferences why did no research journalist ask yet from OpenAI or Google to confirm or deny if they use or used LibGen?
Did OpenAI really bought an ebook of every publication from Cambridge Press, Oxford Press, Manning, APress, and so on? Did any of investors due diligence, include researching the legality of the content used for training?
So the earliest available copyrighted material would be all content published by anybody who died in the year 1953 or earlier.
If the author of an article published in 1950 still has a living author, the work is still copyrighted.
It matters what is legal and what makes sense.
I'm not saying AI is better for journalism than NYT reporters, just that it's more important.
Journalism has been in trouble for decades, sadly -- and I say that as a journalism minor in college. Trump gave the papers a brief respite, but the industry continues to die off, consolidate, etc. We probably need a different business model altogether. My vote is just for public funding with independent watchdogs, i.e. states give counties money to operate newspapers with citizen watchdog groups/boards. Maaaaybe there's room for "premium" niche news like 404 Media/The Information/Foreign Affairs/National Review/etc., but that remains to be seen. If the NYT paywall doesn't keep them alive, I doubt this lawsuit will.
I have seen low fidelity copies of motion pictures recorded by a handheld camera in a theater that I'm pretty sure most would qualify as infringing. The copied product is no doubt inferior, but still competes on price and convenience.
If someone does not wish to pay to read the New York Times then perhaps accepting the risk of non-perfect copies made by a LLM is an acceptable trade off for them to save a dime.
They should have thought of that before they went ahead and trained on whatever they could get.
Image models are going to have similar problems, even if they win on copyright there's still CSAM in there: https://www.theregister.com/2023/12/20/csam_laion_dataset/
A printer is neutral because you have to send it all the data to print out a copy of copyrighted content. It doesn’t contain it inherently.
OSS seems to be developing its own, transparent, datasets.
Legal problems? Update TOS like usual(did they already?). Some might leave, most will stay.
E.g. "Japan's App Store antitrust case"
https://www.perplexity.ai/search/Japans-App-Store-GJNTsIOVSy...
closed/proprietary services that also monetize - there's a question whether it's "fair" to take and use data for free, and then basically resell access to it. the monetization aspect is the bigger rub than just data use.
(maybe it's worth noting again that "openai" is not really "open" and not the same as open source ai/ml.)
taking data, maybe it's data that's free to take, and then as freely distributing resulting work, that's really just fine. taking something for free (without distinction, maybe it's free, maybe it's supposed to stay free, maybe it's not supposed to be used like that, maybe it's copyrighted), and then just ignoring licenses/relicensing and monetizing without care, that's just a minefield.
"Photographing the Eiffel Tower at night is not illegal at all. Any individual can take photos and share them on social networks. But the situation is different for professionals. The Eiffel Tower's lighting and sparkling lights are protected by copyright, so professional use of images of the Eiffel Tower at night requires prior authorization and may be subject to a fee."
Is that really true? Also, what if the second person is not malicious? In the example of ChatGPT, the user may accidentally write a prompt that causes the model to recite copyrighted text. I don't think a judge will look at this through the same lens as you are.
If there's legally-murky secret data sauce, it's firewalled from being easily seen in its entirety by anyone not golden-handcuffed to the company.
They may be able to train against it. They may be able to peek at portions of it. But no one is downloading-all.
OpenAI isn’t marching into the online news space and posting NY Times content verbatim in an effort to steal market share from the NY Times. OpenAI is in the business of turning ‘everything’ (input tokens) into ‘anything’ (output tokens). If someone manages to extract a preserved chunk of input tokens, that’s more like an interesting edge case of the model. It’s not what the model is in the business of doing.
Edit: typo
Wouldn’t those dozen outlets suffer the same harms of producing original content, costing time and talent, and while having a significant portion of the benefit accruing to downstream AI companies?
If most of the benefit of producing original content accrues to the AI firms, won’t original content stop being produced?
If original content stops being produced, how will AI models get better in the future?
Is using something, in its entirety, as a tiny bit of a massive data set, in order to produce something novel... infringing?
That's a pretty weird question that never existed when copyright was defined.
I personally think that giving copyright holders control over who is legally allowed to view a work that has been made publicly available is a huge step in the wrong direction. One of those reasons is open source, but really that argument applies just as well to making sure that smaller companies have a chance of competing.
I think it makes much more sense to go after the infringing uses of models rather than putting in another barrier that will further advantage the big players in this space.
Would it be more rigorous for AI to cite its sources? Sure, but the same could be said for humans too. Wikipedia editors, scholars, and scientists all still struggle with proper citations. NYT itself has been caught plagiarizing[1].
But that doesn't really solve the underlying issue here: That our copyright laws and monetization models predate the Internet and the ease of sharing/paywall bypass/piracy. The models that made sense when publishing was difficult and required capital-intensive presses don't necessarily make sense in the copy and paste world of today. Whether it's journalists or academics fighting over scraps just for first authorship (while some random web dev makes 3x more money on ad tracking), it's just not a long-term sustainable way to run an information economy.
I'd also argue that attribution isn't really that important to most people to begin with. Stuff, real and fake, gets shared on social media all the time with limited fact-checking (for better or worse). In general, people don't speak in a rigorous scholarly way. And people are often wrong, with faulty memories, or even incentivized falsehoods. Our primate brains aren't constantly in fact-checking mode and we respond better to emotional, plot-driven narratives than cold statistics. There are some intellectuals who really care deeply about attributions, but most humans won't.
Taken the above into consideration:
1) Useful AI does not necessarily require attribution
2) AI piracy is just a continuation of decades of digital piracy, and the solutions that didn't work in the 1990s and 2000s still won't work against AI
3) We need some better way to fund human creativity, especially as it gets more and more commoditized
4) This is going to happen with or without us. Cat's outta the bag.
I don't think using old IP law to hold us back is really going to solve anything in the long term. Yes, it'd be classy of OpenAI to pay everyone it sourced from, but long term that doesn't matter. Creativity has always been shared and copied and imitated and stolen, the only question is whether the creators get compensated (or even enriched) in the meantime. Sometimes yes, sometimes no, but it happens regardless. There'll always be noncommercial posts by the billions of people who don't care if AI, or a search engine, or Twitter, or whoever, profits off them.
If we get anywhere remotely close to AGI, a lot of this won't matter. Our entire economic and legal systems will have to be redone. Maybe we can finally get rid of the capitalist and lawyer classes. Or they'll probably just further enslave the rest of us with the help of their robo-bros, giving AI more rights than poor people.
But either way, this is way bigger than the economics of 19th-century newspapers...
[1] https://en.wikipedia.org/wiki/Jayson_Blair#Plagiarism_and_fa...
which is really just a very, very common story with ai problems, be it sources/citations/licenses/usage tracking/etc., it's all just 'too complex if not impossible to solve', which just seems like a facade for intentionally ignoring those problems for benefit at this point. those problems definitely exist, why not try to solve them? because well...actually trying to solve them would entail having to use data properly and pay creators, and that'd just cut into bottom line. the point is free data use without having to pay, so why would they try to ruin that for themselves?
I feel like the crypto evangelists never got off the hype train. They just picked a new destination. I hope the NYT is compensated for the theft of their IP and hopefully more lawsuits follow.
My understanding is that GPT is a word probability lookup table based on a review of the training material. A statistical analysis of NYT is not copying.
And this doesn't even to look at whether fair use might apply. Since tabulating word frequencies isn't copying, GPT isn't violating anyone's copyright.
"Here's how I would cure melanoma!" followed by your detailed findings. Zero mention of you.
F-that. Attribution, as best they can, is the least OpenAI can do as a service to humanity. It's a nod to all content creators that they have built their business off of.
Claiming knowledge without even acknowledging potential sources is gross. Solve it OpenAI.
Show me a prompt that can produce the first paragraph of chapter 3 of the first Harry Potter book. Because i don’t think you can. I don’t think you can prove it’s “in” there, or retrieve it. And if you can’t do either of those things then I think it’s irrelevant to your claims.
LLM training sees these documents without context; it doesn’t know where they came from, and any such attribution would become part of the thing it’s trying to mimic.
It’s still largely an unsolved problem.
ChatGPT Browse and Bing and Google Bard implement the same pattern.
RAG does allow for some citation, but it doesn't help with the larger problem of not being able to cite for answers provided by the unassisted language model.
Legal arguments aside, the goldrush era of data scraping is over. Major sources of content like Reddit and Twitter have killed APIs, added defenses and updated EULAs to avoid being pillaged again. More and more sites are moving content behind paywalls.
There's also the small issue of having 10s of millions of VC dollars to rent/buy hundreds of high end GPUs. OpenAI and friends are also trying their hardest to prevent others doing so via 'Skynet' hysteria driven regulatory capture.
> ...owner...
> Does NYT pays money to the people/countries etc it uses to as subject to create content(NEWS)? Isn't that stealing then?
No, that's why in my reply to "facts like happenings in the world are not copyrightable" I emphasised do the work. Journalism is a job. Happenings do not just fall onto the page.
> Also their website TOS didn't prohibit LLMs from using their data.
This is just lazy. We have rule of law. Individuals don't need to write "don't break law X" to be protected by them. And nytimes does in fact have copyright symbols on its pages - not that it needs them.
And then there's all the run-of-the-mill small-town journalism that AI would probably be even better at than human reporters: all the sports stories, the city council meetings, the environmental reviews...
If AI makes commercial content publishing unviable, that might actually cut down on all the SEO spam and make the internet smaller and more local again, which would be a good thing IMO.
Which means that either OpenAI is allowed to be the only lawbreaker in the country (because rich and lawyers), or nobody is. I say prosecute 'em and tell them to make tools that follow the law.
Single most important development in human history? Are you serious?
Not all of them will have the capability to cite a source, and plenty of them won't have it make sense to cite a source.
Eg. Suppose I train a regression that guesses how many words will be in a book.
Which book do I cite when I do an inference? All of them?
Apple is already doing this: https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...
Apple caught a lot of shit over the past 18 months for their lack of AI strategy; but I think two years from now they're going to look like geniuses.
Yes, all those outlets will suffer the same harms. They have been for decades. That's why there's so few remaining. Most are consolidated and produce worthless drivel now. Their business model doesn't really work in the modern era.
Thankfully, people have and will continue to produce content even if much of it gets stolen -- as has happened for decades, if not millennia, before AI.
If anything what we need is a better way to fund human creative endeavors not dependent on pay-per-view. That's got nothing to do with AI; AI just speeds up a process of decay that has been going on forever.
In what sense are they claiming their generated contents as their own IP?
https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...
> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."
https://openai.com/policies/terms-of-use
> Ownership of Content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.
The majority of the world's computing systems runs on closed source software. Believing the opposite is bubble-thinking. Its not just Windows/MacOS. Most Android distros are not actually open source. Power control systems. Traffic control systems. Networking hardware. Even the underlying operating systems which power the VMs you run on AWS that are technically open source. The billions of little computers that form together to make the modern world work; they're mostly closed source.
Though the other way to do it is to clearly document the training data as a whole, even if you can't cite a specific entry in it for a particular bit of generated output. It should get useless quickly though as you'd eventually have one big citation -- "The Internet"
I think it makes sense to hold model makers responsible when their tools make infringement too easy to do or possible to do accidentally. However that is a far cry from requiring a little longer license to do the trainint in the first place.
For complex subjects, I'm sure the citation page would be large, and a count would be displayed demonstrating the depth of the subject[3].
This is how Google did it with search results in the early days[1]. Most probable to least probable, in terms of the relevancy of the page. With a count of all possible results [2].
The same attempt should be made for citations.
The issue of replicating a style is probably more difficult.
Why did you specify that this stuff you like, you only like if it's "not free"?
The hidden assumption is that the information you like wouldn't be made available unless someone was paying for it. But that's not in evidence; a lot of information and content is provided to the public due to other incentives: self-promotion, marketing, or just plain interest.
Would you prefer not to have access to Wikipedia?
It's really disgusting, IMO, that corporations that go above and beyond that sort of behavior are seeing NO federal investigations for this sort of behavior. Yet a private citizen does it and it's threats of life in prison.
This isn't new, but it speaks to a major hole in our legal system and the administration of it. The Feds are more than willing to steamroll an individual but will think twice over investigating a large corporation engaged in the same behavior.
Which evidence?
Very happy for the helpful replies though.
If OpenAI never meant to allow copyrighted material to be reproduced, shut it down immediately when it was discovered, and the NYT can't show any measurable level of harm (e.g. nobody was unsubscribing from NYT because of ChatGPT)... then the NYT may have a very hard time winning this suit based specifically on the copyright argument.
human analogies are cute, but they're completely irrelevant. it doesn't change that it's specifically about computers, and doesn't change or excuse how computers work.
But if it's possible for the neural net to memorize passages of text then surely it could also memorize where it got those passages of text from. Perhaps not with today's exact models and technology, but if it was a requirement then someone would figure out a way to do it.
Whether their coverage is biased or not is immaterial to their legal argument.
There are ways to make it free to the consumer, yes. One way is charity (Wikipedia) and another way is advertising. Neither is free to produce; the advertising incentive is also nuked by LLMs; and I’m not comfortable depending on charity for all of my information.
It is a lot cheaper to produce low-quality than high-quality information. This is doubly so in a world of LLMs.
There is ONE Wikipedia, and it is surely one of mankind’s crowning achievements. You’re pointing to that to say, “see look, it’s possible!”?
Edit: same applies to humans. Just because a healthcare company puts up a S3 bucket with patient health data with “robots: *” doesn’t give you a right to view or use the crawled patient data. In fact, redistributing it may land you in significant legal trouble. Something being crawlable doesn’t provide elevated rights compared to something not crawlable.
Not when there’s no money in journalism because the generative AIs immediately steal all content. If nyt goes under no one will be willing to start a news business as everyone will see it’s a money loser.
When an AI uses information from an article it's no difference from me doing it in a blog post. If I'm just summarizing or referencing it, that's fair use, since that's my 'take' on the content.
> having to pay for the trained model with them is not stupid?
Because you can charge for anything you want. I can also charge for my summaries of NYT articles.
If you want to assert that groups of people that build and operate LLMs should operate under a different set of laws and regulations than individuals that read books in the library regarding "profit", I'm open to that idea. But that is not at all the same as "anthropomorphizing these AI black boxes".
I don’t know the solution, but I don’t like the idea that anything I post online that is openly viewable is automatically opted into being part of ML/AI training data, and I imagine that opinion would be amplified if my writing was a product which was being directly threatened by the very same models.
New York times made it ridiculously easy for anyone to access their content by putting it in WWW for making money from page impressions. And they started ingesting links of their content to social media, search engines, etc.
And now they are acting surprised someone used the content to train an LLM.
Should have done their job in the first place to prevent it from training LLMs and make it less.
But they didn't because that affects their page impressions and ad views.
Because the more open the content the more money they make everyone click on a link and see the ad.
You can't have it both ways.
If you do gambling by making content so open so you can get more views from ads, you also get to enjoy the consequences and not cry like a baby asking for billions by making stupid decisions in the first place.
More importantly, ever case is unique so what really came up was a set of principles for what defines fair use, which will definitely guide this.
Well, they didn't charge for it, right? They're retroactively asking for money, but they could have just locked their content behind a strict paywall or had a specific licensing agreement enforceable ahead of time. They could do that going forward, but how is it fair for them to go back and say that?
And the issue isn't "You didn't pay us" it's "This infringes our copyright", which historically the answer has been "no it doesn't".
I agree. You can even listen to the NYT Hard Fork podcast (that I recommend btw https://www.nytimes.com/2023/11/03/podcasts/hard-fork-execut...) where they recently had Harvard copyright law professor Rebecca Tushnet on as a guest.
They asked her about the issue of copyrighted training data. Her response was:
""" Google, for example, with the book project, doesn’t give you the full text and is very careful about not giving you the full text. And the court said that the snippet production, which helps people figure out what the book is about but doesn’t substitute for the book, is a fair use.
So the idea of ingesting large amounts of existing works, and then doing something new with them, I think, is reasonably well established. The question is, of course, whether we think that there’s something uniquely different about LLMs that justifies treating them differently. """
Now for my take: Proving that OpenAI trained on NYT articles is not sufficient IMO. They would need to prove that OpenAI is providing a substitutable good via verbatim copying, which I don't think you can easily prove. It takes a lot of prompt engineering and luck to pull out any verbatim articles. It's well-established that LLMs screw up even well-known facts. It's quite hard to accurately pull out the training data verbatim.
Figure this out and you get to choose which AI lab you want to make seven figures at. It's a really difficult problem.
When for-profit companies seek access to library material they pay a much much higher price.
> To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries
If copyright is starting to impede rather than promote progress, then it needs to change to remain constitutional.
2. The non-profit OpenAI, Inc. company is not to be confused with the for-profit OpenAI GP, LLC [0] that it controls. OpenAI was solely a non-profit from 2015-2019, and, in 2019, the for-profit arm was created, prior to the launch of ChatGPT. Microsoft has a significant investment in the for-profit company, which is why they're included in this lawsuit.
AI will likely steamroll current copyright considerations. If we live in a world where anything can be generated at whim, copyright considerations will seem less and less relevant or even possible.
Wishful thinking, but maybe we'll all turn away from obsession with ownership, and instead turn to feeding the poor, clothing the naked, visiting the sick and afflicted.
As for "what would Google do with all these book copies anyway if they can't make it public?", that has now been answered more directly than ever.
It won't. Thats not how capitalisam works. If high quality data became unavailable, then companies will be created to fix the problem. Only they look quite different from NYT.
Just like how Torrents didn't kill movie industry. These are lazy arguments made by people who want to make money through lawsuits.
Also I can guarentee you even in worst case, humanity would survive just fine without those high quality content just like it did for the past 50K+ years.
What you should actually be concerned about is stupid law suits like this that can prevent progress.
AI could help humanity solve more pressing problems like cancer.
By getting caught up in silly law suits like this and delaying progress one can make a case that you bring more suffering to the world.
Saying they don’t claim the rights over their output while outputting large chunks verbatim is the old YouTube scheme of upload movie and say “no copyright intended”.
He also did not distribute the information wholesale. What he planned on doing with the information was never proven.
OpenAI IS distributing information they got wholesale from the internet without license to that information. Heck, they are selling the information they distribute.
Is there something out there that seems like a killer application?
I was amazed at the idea of the block chain but we never found a use for it outside of cryptocurrency. I see a similariy with AI hype.
Mind you, Google books, literally just text from copyrighted books published for everyone online, was ruled "fair use", due to it's benefit to humanity.
If OpenAI got their hands on an S3 bucket from Aetna (or any major insurer) with full and complete health records on every American, due to Aetna lacking security or leaking a S3 bucket, should OpenAI or any other LLM provider be allowed to use the data in its training even if they strip out patient names before feeding it into training?
The difference between this question or NYT articles is that this question asks about content we know should not be available publicly online (even though it is or was at some point in the past).
I guess this really gets at “do we care about how the training data was obtained or pre-processed, or do we only care about the output (a model’s weights and numbers, etc)
Of course, I’m not a lawyer and I know that in the US sticking to precedents (which mention the “verbatim” thing) takes a lot of precedence over judging something based on the spirit of the law, but stranger things have happened.
I'm eagerly awaiting the time where the people making these decisions at least have some sort of baseline level of understanding, otherwise these psychopathic megacorps will keep getting away with things based on technicalities and the judge's lack of knowledge.
Also: GPT is not a legal entity in the united states. Humans have different rights than computer software. You are legally allowed to borrow books from the library. You are legally allowed to recite the content you read. You're not allowed to sell verbatim recitation of what you read. This is, obvious, I think? But its exactly what LLMs are doing right now.
https://dspace.mit.edu/handle/1721.1/153216
As it should be.
Google may simply have been obliged to follow suit.
Personally, I’m looking forward to pirate LLMs trained on academic content.
You can get basically-but-not-quite-exactly the copyrighted material that it was trained on.
Saw this a lot with some earlier image models where you could type in an artists name and get their work back.
The fact that AI models are having to put up guardrails to prevent that sort of use is a good sign that they weren't trained ethically and they should be paying a ton of licensing fees to the people whose content they used without permission.
As a counter argument it might be reasonable to instead say that the NYT delivers "current information" so perhaps it'd be fair to train your model on articles so long as they aren't too recent... but I think a lot of the information that the NYT now relies on for actual traffic is their non-temporal stuff - including things like life advice and recipes.
If somehow it could be proven without doubt that deanonymising that data wasn't possible (which cannot be done), then the harm probably wouldn't be very big aside from just general data ownership concerns which are already being discussed.
If it is legal to simply index a website, then why shouldn't it be legal to train a model in the very same data?
Of course, websites should have some option for declining data mining for ML/AI purposes, in the same way the can decline scraping/indexing in the robots.txt file.
But that ship has kind of sailed, unless the courts decide otherwise.
Unequivocally, yes.
LLMs have proved themselves to be useful, at times, very useful, sometimes invaluable assistants who work in different ways than us. If sticking health data into a training set for some other AI could create another class of AI which can augment humanity, great!! Patient privacy and the law can f*k off.
I’m all for the greater good.
If your "fair use" substantially negatively affects the market for the original source material, which I think is fairly clear in this case, the courts wont look favorably on that.
Of course, I think this is a great test case precisely because the power of "Internet scale" and generative AI is fundamentally different than our previous notions about why we wanted a "fair use exception" in the first place.
Which part of journalism is AI going to impact most? Opinion pieces that contain no new information? Summarizing past events?
There was outrage about Amazon removing DPReview site recently. But, it would be a common practice not to publish code/info, which could be used to train the model of another company. So, expect less open source projects, that companies just released because they were feeling like it could be good for everyone.
Actually, there is the use case that NYT would become more influential and important, because if 99% of all info is generated by AI and search is not working anymore, we would have to rely on the trusted sources to get our info. In the world of garbage, we would have to have some sources of verifiable human-generated info.
Can I apply for YC with this idea?
[1] http://web.archive.org/web/20120608192927/http://www.google....
[2] https://steemit.com/online/@jaroli/how-google-search-result-...
[3] https://www.smashingmagazine.com/2009/09/search-results-desi...
[4] Next page
:)
You actually need a lot more than that. Most significantly, you need to have registered the work with the Copyright Office.
“No civil action for infringement of the copyright in any United States work shall be instituted until ... registration of the copyright claim has been made in accordance with this title.” 17 USC §411(a).
Couldn't disagree more strongly, and I hope the outcome is the exact opposite. I think we've already started to see the severe negative consequences when the lion's share of the profits get sucked up by very, very few entities (e.g. we used to have tons of local papers and other entities that made money through advertising, now Google and Facebook, and to a smaller extent Amazon, suck up the majority of that revenue). The idea that everyone else gets to toil to make the content but all the profits flow to the companies with the best AI tech is not a future that's going to end with the utopia vision AI boosters think it will.
Even for open source code you cannot just remove the authors and license, replace some functions and say "oh, it is my code now". Only public domain code would allow these. But with copilot you could.
Copyright is granted to the creator upon creation.
Humanity is better off without these mass brainwashing systems.
Millions of independent journalists will be better outcome for humanity.
On the one hand, they should realize they are one of today’s horse carriage manufacturers. They’ll only survive in very narrow realms (someone has to build the Central Park horse carriages still), but they will be miniscule in size and importance.
On the other hand, LLMs should observe copyright and not be immune to copyright.
> the suit showed that Browse With Bing, a Microsoft search feature powered by ChatGPT, reproduced almost verbatim results from Wirecutter, The Times’s product review site. The text results from Bing, however, did not link to the Wirecutter article, and they stripped away the referral links in the text that Wirecutter uses to generate commissions from sales based on its recommendations.
The knowledge gets distorted, blended, and reinterpreted a million ways by the time it's given as output.
And the metadata (metaknowledge?) would be larger than the knowledge itself. The AI learnt every single concept it knows by reading online; including the structure of grammar, rules of logic, the meaning of words, how they relate to one another. You simply couldn't cite it all.
OpenAI doesn't just get to steal work and then say "sorry, not possible" and shrug it off.
The NYTimes should be suing.
A lawsuit that proves verbatim copies, might have a point. But then there is the notion of fair use, which allows hip hop artists to sample copyrighted material, allows journalists to cite copyrighted literature and other works, and so on. There are a lot of existing rulings on this. Legally, it's a bit of a dog's breakfast where fair use stops and infringement begins. Upfront, the NYT's case looks very weak.
A lot of art and science is inherently derivative and inspired by earlier work. So is art. AI insights aren't really any different. That's why fair use exists. Society wouldn't be able to function without it. Fair remuneration extents only to the exact form and shape you published in for a limited amount of time and not much else. Publishing page and page of NYT content would be a clear infringement. But a citation here and there, or a bit of summary, paraphrasing, etc. not so much.
The ultimate outcome of this is simply models that exclude any NYT content. I think they are overestimating the impact that would have. IMHO it would barely register if their content were to be excluded.
Search for "four factors of fair use", e.g. https://fairuse.stanford.edu/overview/fair-use/four-factors/, which courts use to decide if a derived work is fair use. I think OpenAI will get killed in that fourth factor, "the effect of the use upon the potential market", which is what this case is really about. If the use substantially negatively affects the market for the original work, which I think it's easy to argue that it does, that is a huge factor against awarding a fair use exemption to OpenAI.
Also, plagiarism has nothing to do with copyright. It has to do with attribution. This is easily proven: you can plagiarise Beethoven's music even though it's public domain.
To use Andrew Ng's example, you have build a multi-dimensional arrow representing "king". You compare it to the arrow for "queen" and you see that it's almost identical, except it points in the opposite direction in the gender dimension. Compare it to "man" and you see that "king" and "man" have some things in common, but "man" is a broader term.
That's getting really close to understanding as far as I'm concerned; especially if you have a large number of such arrows. It's statistical in a literal sense, but it's more like the computer used statistics to work out the meaning of each word by a process of elimination and now actually understands it.
I don't know if I would agree that it is "probably the single most important development in human history" but I think that it is way to early to make a reasonable guess of if it will or not.
Yes, we all agree that it's better if they do remember and mention their sources, but we don't sue them for failing to do so.
What am I missing?
I think publications should be protected enough to keep them in business, so I don't really know what to make of this situation.
So it is not good when people use copyleft as a justification for copyright, given that its whole purpose was to destroy it.
NYT just happens to be an entity that can afford to fight Microsoft in court.
That right ended when he used it to break the law. It was also for use on MIT computers, not for remote access (which is why he decided to install the laptop, also knowing this was against his "right to use").
The "right to use" also included a warning that misuse could result in state and federal prosecutions. It was not some free for all.
> and pull the JSTR information
No, he did not have the right to pull en masse. The JSTOR access explicitly disallowed that. So he most certainly did not have the "right" to do that, even if he were sitting at MIT in an office not breaking into systems.
> did it in a shady way
The word you're looking for is "illegal." Breaking and entering is not simply shady - it's illegal and against the law. B&E with intent to commit a felony (which is what he was doing) is an even more serious crime, and one of the charges.
> he did it that way because he didn't want someone stealing or unplugging his laptop
Ah, the old "ends justifies break the law" argument.
Now, to be precise, MIT and JSTOR went to great lengths to stop the outflow of copying, which both saw. Schwartz returned multiple times to devise workarounds, continuing to break laws and circumvent yet more security measures. This was not some simply plug and forget laptop. He continually and persistently engaged in hacking to get around the protections both MIT and JSTOR were putting in place to stop him. He added a second computer, he used MAC spoofing, among other things. His actions started to affect all users of JSTOR at MIT. The rate of outflow caused JSTOR to suffer performance, so JSTOR disabled all of MIT access.
Go read the indictment and evidence.
> OpenAI IS distributing information they got wholesale
No, that ludicrous. How many complete JSTOR papers can I pull from ChatGPT? Zero? How many complete novels? None? Short stories? Also none? Can I ask for any of a category of items and get any of them? Nope. I cannot.
It's extremely hard to even get a complete decent sized paragraph from any work, and almost certainly not one you pre-select at will (most of those anyone produces are found by running massive search runs, then post selecting any matches).
Go ahead and demonstrate some wholesale distribution - pick an author and reproduce a few works, for example. I'll wait.
How many could I get from what Schwartz downloaded? Millions? Not just even as text - I could have gotten the complete author formatted layout, diagrams, everything, in perfect photo ready copy.
You're being dishonest in claiming these are the same. One can feel sad for Schwartz outcome, realize he was breaking the law, and realizing the current OpenAI copyright situation is likely unlike any previous copyright situation all at the same time. No need to equate such different things.
Copying is not theft.
Stealing a thing leaves one less left
Copying it makes one thing more;
that’s what copying’s for.> If your "fair use" substantially negatively affects the market for the original source material, which I think is fairly clear in this case, the courts wont look favorably on that.
I think it's fairly clear that it doesn't. No one is going to use ChatGPT to circumvent NYTimes paywalls when archive.ph and the NoPaywall browser extension exist and any copyright violations would be on the publisher of ChatGPT's content.
But let's not pretend like any of us have any clue what's going to happen in this case. Even if Judge Alsup gets it, we're so far in uncharted territory any speculation is useless.
All it would do is momentarily slow AI progress (which is fine), and allow OpenAI et al to pull the ladder up behind them (which fuels centralization of power and profit).
By what mechanism do you think your desired outcome would prevent centralization of profit to the players who are already the largest?
"We also collect the content you create, upload, or receive from others when using our services. This includes things like email you write and receive, photos and videos you save, docs and spreadsheets you create, and comments you make on YouTube videos."
Thing is though, if you look at the prompts they used to elicit the material, the prompt was already citing the NYTimes and its articles by name.
Furthermore, if we manage to "untrain" AI on certain pieces of content, then copyright would really become "brain" damage too. Like, the perceptrons and stuff.
[1] https://www.youtube.com/watch?v=XO9FKQAxWZc
[2] No, I'm not an AI, just autistic.
That would be like me just photocopying a book you wrote and then handing out copies saying we’re assigning different rights to the content. The whole point of the lawsuit is that OpenAI doesn’t own the content and thus they can’t just change the ownership rights per their terms of service. It doesn’t work like that.
It seems obvious to me that, despite what current law says, there is something not right about what large companies are doing when they create LLMs.
If they are going to build off of humanity's collective work, their product should benefit all of humanity, and not just shareholders.
Copyright law allows for transformative uses that add something new, with a further purpose or different character, and do not substitute for the original use of the work. Are LLM’s not transformative?
1. If you run different software (LLM), install different hardware (GPU/TPU), and use it differently (natural language), to the point that in many ways it's a different kind of machine; does it actually surprise you that it works differently? There's definitely computer components in there somewhere, but they're combined in a somewhat different way. Just like you can use the same lego bricks to make either a house or a space-ship, even though it's the same bricks. For one: GPT-4 is not quite going to display a windows desktop for you (right-this-minute at least)
2. Comparing to humans is fine. Else by similar logic a robot arm is not a human arm, and thus should not be capable of gripping things and picking them up. Obviously that logic has a flaw somewhere. A more useful logic might be to compare eg. Human arm, Gorilla arm, Robot arm, they're all arms!
Because URLs are usually as long as the writing they point at?
Copyright law is a prehistoric and corrupt system that has been about protecting the profit margins of Disney and Warner Bros rather than protecting real art and science for living memory. Unless copy/paste superhero movies are your definition of art I suppose.
Unfortunately it seems like judges and the general public are so clueless as to how this technology works it might get regulated into the ground by uneducated people before it ever has a chance to take off. All so we can protect endless listicle factories. What a shame.
To help understand the complexity of an LLM consider that these models typically hold about 10,000 less parameters than the total characters in the training data. If one wants to instruct the LLM to search the web and find relevant citations it might obey this command but it will not be the source of how it formed the opinions it has in order to produce its output.
Hacker News consistently have upvoted posts to let users circumvent paywalls. And even when it doesn't, conversations here (and on Twitter, Reddit, etc.) that summarize the articles and quote the relevant bits as soon as the articles are published are much more of a threat to The New York Times than ChatGPT training on articles from months/years ago.
More critically, while fair use decisions are famously a judgement call, I think OpenAI will lose this based on the "effect of the fair use on the potential market" of the original content test. From https://fairuse.stanford.edu/overview/fair-use/four-factors/ :
> Another important fair use factor is whether your use deprives the copyright owner of income or undermines a new or potential market for the copyrighted work. Depriving a copyright owner of income is very likely to trigger a lawsuit. This is true even if you are not competing directly with the original work.
> For example, in one case an artist used a copyrighted photograph without permission as the basis for wood sculptures, copying all elements of the photo. The artist earned several hundred thousand dollars selling the sculptures. When the photographer sued, the artist claimed his sculptures were a fair use because the photographer would never have considered making sculptures. The court disagreed, stating that it did not matter whether the photographer had considered making sculptures; what mattered was that a potential market for sculptures of the photograph existed. (Rogers v. Koons, 960 F.2d 301 (2d Cir. 1992).)
and especially
> “The economic effect of a parody with which we are concerned is not its potential to destroy or diminish the market for the original—any bad review can have that effect—but whether it fulfills the demand for the original.” (Fisher v. Dees, 794 F.2d 432 (9th Cir. 1986).)
The "whether it fulfills the demand of the original" is clearly where NYTimes has the best argument.
Suppose I research for a book that I'm writing - it doesn't matter whether I type it on a Mac, PC, or typewriter. It doesn't matter if I use the internet or the library. It doesn't matter if I use an AI powered voice-to-text keyboard or an AI assistant.
If I release a book that has a chapter which was blatantly copied from another book, I might be sued under copyright law. That doesn't mean that we should lock me out of the library, or prevent my tools from working there.
You are correct, if I were to steal something, surely I can be made to give it back to you. However, if I haven't actually stolen it, there is nothing for me to return.
By analogy, if OpenAI copied data from the NYT, they should be able to at least provide a reference. But if they don't actually have a proper copy of it, they cannot.
In any case, the point is that they made no claim to Output (as opposed to their code, etc) being their IP.
Now there are (or very, very soon there will be) two members in that set. How do we properly define the rules for members of that set?
If something can learn from reading do ban it from reading copyrighted material, even if it can memorize some of it? Clearly that would be a failure for humans a ban of that form. Should we have that ban for all things that can learn?
There is a reasonable argument that if you want things to learn they have to learn on a wide variety, and on our best works (which are often copyrighted).
And the statements above have no implication of being free of cost (or not), just that I think blocking "learning programs / LLMs" from being able to access, learn from or reproduce copyright text is a net loss for society.
I don’t mean to go off on too deep of a tangent, but if one person’s (or even many people’s) idea of what’s good for humanity is the only consideration for what’s just, it seems clear that the result would be complete chaos.
As it stands, it doesn’t seem to be an “either or” choice. Tech companies have a lot of money. It seems to me that an agreement that’s fundamentally sustainable and fits shared notions of fairness would probably involve some degree of payment. The alternative would be that these resources become inaccessible for LLM training, because they would need to put up a wall or they would go out of business.
I definitely agree with that (at least the "far in uncharted territory bit", but as far as "speculation being useless", we're all pretty much just analyzing/guessing/shooting the shit here, so I'm not sure "usefulness" is the right barometer), which is why I'm looking forward to this case, and I also totally agree the assessment is flexible.
But I don't think your argument that it doesn't negatively affect the market holds water. Courts have held in the past that the market for impact is pretty broadly defined, e.g.
> For example, in one case an artist used a copyrighted photograph without permission as the basis for wood sculptures, copying all elements of the photo. The artist earned several hundred thousand dollars selling the sculptures. When the photographer sued, the artist claimed his sculptures were a fair use because the photographer would never have considered making sculptures. The court disagreed, stating that it did not matter whether the photographer had considered making sculptures; what mattered was that a potential market for sculptures of the photograph existed. (Rogers v. Koons, 960 F.2d 301 (2d Cir. 1992).)
From https://fairuse.stanford.edu/overview/fair-use/four-factors/
Shoulders of giants.
Thanks to the existence of medicine, agriculture, and electrification (we can argue about music), some people are now healthy, well fed, and sufficiently supplied with enough electricity to go make LLMs.
> I hope the NYT is compensated for the theft of their IP and hopefully more lawsuits follow.
Personally I think all these "theft of IP" lawsuits are (mostly) destined to fail. Not because I'm on a particular side per-se (though I am), but because it's trying to fit a square law into a round hole.
This is going to be a job for legislature sooner or later.
It seems like a very difficult engineering challenge to provide attribution for content generated by LLMs, while preserving the traits that make them more useful than a “mere” search engine.
Which is to say nothing about whether that challenge is worth taking on.
Making the process for training AI require an army of lawyers and industry connections will have the opposite effect than you intend.
If I paid a human to recite the whole front page of the New York Times to me, they could probably do it. There's nothing infringing about that. However, if I videotape them reciting the front page of the New York Times and start selling that video, then I'd be infringing on the copyright.
The guy that I paid to tell me about what NYT was saying didn't do anything wrong. Whether there's any copyright infringement would depend what I did with the output.
If we could clone the brain of someone I hardly think we'd be discussing their vast knowledge of something so insignificant as the NYT. I don't think we should care that much about an AI's vast knowledge of the NYT either or why it matters.
If all these journalism companies don't want to provide the content for free they're perfectly capable of throwing the entire website behind a login screen. Twitter was doing it at one point. In a similar vein, I have no idea why newspapers are complaining about readership while also paywalling everything in sight. How exactly do they want or expect to be paid?
Here's a hypothetical: suppose there is a random fact about some news event that has only been reported in a single article. Do they suddenly have a monopoly on that fact, and deserve compensation whenever that fact gets picked up and repeated by other news articles or books or TV shows or movies (or AI models)?
Now imagine terabytes worth of datapoints, and thousands of dimensions rather than two.
But if I were OpenAI, I would have tried to do a deal to pay them anyway. Having official access is surely easier than scraping the web - and the optics of it is much better.
The end game when large content producers like The New York Times are squeezed due to copyright not being enforced is that they will become more draconian in their DRM measures. If you don't like paywalls now, watch out for what happens if a free-for-all is allowed for model training on copyrighted works without monetary compensation.
I had a similar conversation with my brother-in-law who's an economist by training, but now works in data science. Initially he was in the side of OpenAI, said that model training data is fair game. After probing him, he came to the same conclusion I describe: not enforcing copyright for model training data will just result in a tightening of free access to data.
We're already seeing it from the likes of Twitter/X and Reddit. That trend is likely to spread to more content-rich companies and get even more draconian as time goes on.
Unclear what that corpora might be, or if its the same books2 you are referring to.
Overall, current LLMs remind me of those bottom-feeder websites that do no original research--those sites that just find an article they like, lazily rewrite it, introduce a few errors, then maybe paste some baloney "sources" (which always seems to disinclude the actual original source). That mode of operation tends to be technically legal, but it's parasitic and lazy and doesn't add much value to the world.
All that aside, I tend to agree with the hypothesis that LLMs are a fad that will mostly pass. For professionals, it is really hard to get past hallucinations and the lack of citations. Imagine being a perpetual fact-checker for a very unreliable author. And laymen will probably mostly use LLMs to generate low-effort content for SEO, which will inevitably degrade the quality of the same LLMs as they breed with their own offspring. "Regression to mediocrity," as Galton put it.
I think you're nissing the point, and putting cart before horse. If you ensure that corporations are treated as stringently as people are sometimes, the reverse is true. And that means your goal will presumably be obtained, as the corporate might, becomes the little guy's win.
All with no unjust treatment.
I happen to agree on that one. What is the benefit of copyrighting the Eiffel Tower? The purpose of copyright is not to say you can always make money off of what you created. It is to incentivize the creation of new things by allowing you to exclusively make money off of it for a while before its benefits can go to broader society.
So what is the purpose of copyrighting the Eiffel tower? Would it not have been made if copyright wasn't in place? (obviously it would have because it was and the law wasn't in place yet). Second the claim is that the copyright is on the "lighting design" visible at night. Is the lighting design of the tower so unique that no-one else could come up with it? or is the lighting design necessitated by the structure of the tower itself?
I'd say given the structure of the tower which restricts the lights, there is nothing sufficiently remotely unique or different to warrant copyright of the lighting design. Almost any design on that tower would look about the same.
So how is society benefiting from copyrighting that lighting design?
Exclusivity deals are almost always a net loss for society. Which is why whenever you see one you should be questioning if it should be in place. Exclusive contracts are anti free-market. Now there are absolutely valid places where they are justified and should be in place - but they should be questioned by default.
This isn't even "fair use". The ideas in a work are simply not protected by copyright, only the form is.
Why do you say that? Search engines would at least direct the viewer to the source. NYT gets 35%+ of its traffic from Google: https://www.similarweb.com/website/nytimes.com/#traffic-sour...
In some far flung future where an AI can send agents to record and interpret events, and process news feeds and others to extract and corroborate information, this would greatly change. But probably in that world the OpenAI of those times wouldn't really bother training on NYT data at all.
I could use Photoshop to reproduce a copyrighted work, and in some circumstances (i.e. personal use) that'd be fine. Or I could use Photoshop to reproduce a copyrighted work and try to sell it for profit, which would clearly not be fine. Nobody is saying that Adobe has to recognize whether or not the pixels I'm editing constitute a copyrighted work or not.
No.
Do you mean in the Browsing Mode or something? I don't think it is naturally capable of that, both because it is performing lossy compression, and because in many cases it simply won't know where the text that was fed to it during training came from.
It's very clear that OpenAI couldn't predict all of the ways users could interact with its model, as we quickly saw things like prompt discovery and prompt injections happening.
And so not only is it reasonable that OpenAI didn't know users would be able to retrieve snippets of training material verbatim, it's reasonable to say they weren't negligent in not knowing either. It's a new technology that wasn't meant to operate like that. It's not that different from a security vulnerability that quickly got patched once discovered.
Negligence is about not showing reasonable care. That's going to be very hard to prove.
And it's not like people started using ChatGPT as a replacement for the NYT. Even in a lawsuit over negligence, you have to show harm. I think the NYT will be hard pressed to show they lost a single subscriber.
https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...
>> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."
How are they giving you the rights to the work if they don't own it? They are literally asserting that they are in a position to assign the rights (to the output) to the user - that is a literal claim of ownership.
IOW, if someone says "Take this from me, I assure you it is legal to do so", they are asserting ownership of that thing.
Easy to grandstand when it is not your job on the line.
The other question, which I think is more topical to this lawsuit, is whether the company that trains and publishes the model itself is infringing, given they're making available something that is able to reproduce near-verbatim copyrighted works, even if they themselves have not directly asked the model to reproduce them.
I certainly don't have the answers, but I also don't think that simplistic arguments that the cat is already out of the bag or that AIs are analogous to humans learning from books are especially helpful, so I think it's valid and useful for these kinds of questions to be given careful legal consideration.
I find irony in the newspaper suing AI when other news sources (admittedly not NYT) use AI to write the articles. How many other AI scrapers are just ingesting AI generated content?
And I should mention YouTubers wouldn't be making that much money if YouTube weren't enforcing copyright, as you could just upload their videos and get the ad money. Without copyright, you could also cut off their in-video promotions and add your own, including your own Patreon - so you would get 100% of the money off their work if you can out-promote them.
It's only live performances which are protected by the physical world's strict no-copying laws (the ones that don't allow the same macro object to be in two places at the same time).
So basically, no medium which allows copying of the works in whole or nearly whole has been successfully run with public works.
For writers maybe, but absolutely not for programmers, it's incredibly useful. I don't think anyone who's used GPT4 to improve their coding productivity would consider it a fad.
Another way of looking at this is that bottom-feeder websites do work that could easily be done by an LLM. I've noticed a high correlation between "could be AI" and "is definitely a trashy click bait news source" (before LLMs were even a thing).
To be clear, if your writing could be replaced by an LLM today, you probably aren't a very good writer. And...I doubt this technology will stop improving, so I wouldn't make the mistake of thinking that 2023 will be a high point for LLMs and they aren't much better in 2033 (or whatever replaces them).
Eventually these LLMs are going to be put in mechanical bodies with the ability to interact with the world and learn (update their weights) in realtime. Consider how absurd your perspective would be then, when it'd be illegal for this embodied LLM to read any copyrighted text, be it a book or a web page, without special permission from the copyright holder, while humans face no such restriction.
Neither MIT nor JSTOR raised issue with what Schwartz did. JSTOR even went out of their way to tell the FBI they did not want him prosecuted.
Remember, again, with what he was charged. Wiretapping and intent to distribute. He wasn't charged with trespassing, breaking and entering, or anything else. Wiretapping and intent to distribute.
> His actions started to affect all users of JSTOR at MIT. The rate of outflow caused JSTOR to suffer performance, so JSTOR disabled all of MIT access.
And this is where you are confusing a "crime" with "misuse of a system". MIT and JSTOR were in their rights to cut access. That does not mean that what Schwartz did was illegal. Similar to how if a business owner tells you "you need to leave now" you aren't committing a crime because they asked you to leave. That doesn't happen until you are trespassed.
> Go ahead and demonstrate some wholesale distribution - pick an author and reproduce a few works, for example. I'll wait.
You violate copyright by transforming. And fortunately, it's really simple to show that chat GPT will violate and simply emit byte for byte chunks of copyrighted material.
You can, for example, ask it to implement Java's Array list and get several verbatim parts of the JDKs source code echoed back at you.
> How many could I get from what Schwartz downloaded?
0, because he didn't distribute.
And on this subject, it seems worthwhile to note that compression has never freed anyone from copyright/piracy considerations before. If I record a movie with a cell phone at a worse quality, that doesn't change things. If a book is copied and stored in some gzipped format where I can only read a page at a time, or only read a random page at a time, I don't think that's suddenly fair-use.
Not saying these things are exactly the same as what LLMs do, but it's worth some thought, because how are we going to make consistent rules that apply in one case but not the other?
Look at SpaceX. They paid a collective $0 to the individuals who discovered all the physics and engineering knowledge. Without that knowledge they're nothing. But still, aren't we all glad that SpaceX exists?
In exchange for all the knowledge that SpaceX is privatizing, we get to tax them. "You took from us, so we get to take it back with tax."
I think the more important consideration isn't fairness it's prosperity. I don't want to ruin the gravy train with IP and copyright law. Let them take everything, then tax the end output in order to correct the balance and make things right.
"Google Agrees to Pay Canadian Media for Using Their Content" - https://www.nytimes.com/2023/11/29/world/americas/google-can...
https://docs.github.com/en/copilot/configuring-github-copilo...
Given how cheap text search is compared with LLM inference, and that GitHub reuses the same infrastructure for its code search, I doubt it adds more than 1% to the total cost.
Fortunately, the computer isn't the one being sued.
Instead it is the humans who use the computer. And those humans maintain their existing rights, even if they use a computer.
Sorry but if that’s the alternative to some writers feeling slighted, I’ll choose for the writers to be sad and the tech to be free.
The same is not true for AI, which require copyrighted work be contained therein, in order for the tool part to function.
I have no idea what on earth you are talking about. People and corporations are sued for copyright infringement all the time.
https://copyrightalliance.org/copyright-cases-2022/
Reading and consuming other people content isn't illegal, but it also wouldn't be for a computer.
Reading and consuming content with the sole purpose of reproducing it verbatim is frowned upon, and can be sued, whether it's an LLM or a sweatshop in India.
Eh, I would trust my own testing before trusting a tool that claims to have somehow automated this process without having access to the weights. Really it’s about how unique your content is and how similar (semantically) an output from the model is when prompted with the content’s premise.
I believe you, in any case. Just wanted to point out that lots of these tools are suspect.
They're sued for _producing content_, not consuming content. If a human takes copyrighted output from an LLM and publishes it, they're absolutely liable if they violated copyright.
>Reading and consuming other people content isn't illegal, but it also wouldn't be for a computer.
That is absolutely what people in this thread are suggesting should happen: that it should be illegal for OpenAI et. al. to train models on publicly available content without first receiving permission from the authors.
>Reading and consuming content with the sole purpose of reproducing it verbatim is frowned upon, and can be sued, whether it's an LLM or a sweatshop in India.
That's irrelevant here because people training LLMs aren't feeding them copyrighted content for the sole purpose of reproducing it verbatim.
Facts are not subject to copyright. It's very obvious ChatGPT is more than a search engine regurgitating copies of pages it indexed.
Craftsmen don't claim copyright on their artifacts. Furniture designs were widely copied; but Chippendale did alright for himself. Gardeners at stately homes didn't rely on copyright. Vergil, Plato and Aristotle managed OK without copyright. People made a living composing music, songs and poetry before the idea of copyright was invented. Truck-drivers make a living; driving a truck is hardly a performance art. Labourers and factory workers get by successfully. Accountants and legal advocates get rich without copyright.
None of these trades amounts to "performance arts".
Newspapers are very powerful and they own the platform to push their opinion. I'm not about to forget the EU debates where they all (or close to all) lied about how meta tags really work to push it their way, they've done it and they will do it again.
We've always been in that situation. Computers made the copying, transmission and processing of information trivial since the day they were invented. They changed the world forever.
It's the intellectual property industry that keeps denying reality since it's such an existential threat to them. They think they actually own those bits. They think they can own numbers. It's time to let go of such insane notions but they refuse to let it go.
It doesn't have to be perfect to be helpful, and even something that is very imperfect would at least send the signal that model-owners give a shit about attribution in general.
Given a specific output, it might be hard to say which sections of the very large weighted network were tickled during the output, and what inputs were used to build that section of the network. But this level of "citation resolution" is not always what people are necessarily interested in. If an LLM is giving medical advice, I might want to at least know whether it's reading medical journals or facebook posts. If it's political advice/summary/synthesis, it might be relevant to know how much it's been reading Marx vs Lenin or whatever. Pin-pointing original paragraphs as sources would be great, but for most models it's not like there's anything that's very clear about the input datasets.
EDIT: Building on this a bit, a lot of people are really worried about AI "poisoning the well" such that they are retraining on content generated by other AIs so that algorithmic feeds can trash the next-gen internet even worse than the current one. This shows that attribution-sourcing even at the basic level of "only human generated content is used in this model" can be useful and confidence-inspiring.
By your logic, Firefox is re-distributing content without permission from the copyright owners whenever you use it to read a pirated book. ChatGPT isn't just randomly generating copyrighted content, it just does so when explicitly prompted by a user.
You can do exactly the same with a human author or artist if you prompt them to. And if you decide to publish this material, you're the one liable for breach of copyright, not the person you instructed to create the material.
That's like suing someone who had an NYT subscription and read the paper daily for occasionally quoting a choice phrase verbatim. I've been quite critical of AIs impact on the livelihood of artists (whose economic position is precarious to start with, and who are now faced with replacement by machine generated art) but at the same time I reject the copyright complaint completely. Transformers are very obviously doing something else, similar to how a human learns and recreates; the key difference is that they can do it at scale unreachable by individuals.
To anthropomorphize it further, it's a plagiarizing bullshitter who apologizes quickly when any perceived error is called out (whether or not that particular bit of plagiarism or fabrication was correct), learning nothing, so its apology has no meaning, but it doesn't sound uppity about being a plagiarizing bullshitter.
The main beneficiaries are not AI companies but AI users, who get tailored answers and help on demand. For OpenAI all tokens cost the same.
BTW, I like to play a game - take a hefty chunk of text from this page (or a twitter debate) and ask "Write a 1000 word long, textbook quality article based off this text". You will be surprised how nice it comes out, and grounded.
No it's not, it's pure greed. Everyone'd think it absurd if copyright holders dared to demand that any human who reads their publicly available text has to pay them a fee, but just because OpenAI are training a brain made of silicon instead of a brain made of carbon all the rent-seekers come out to try to take advantage.
Of course, if the input I give to ChatGPT is "here is a piece from an NYT aricle, please tell it to me again verbatim", followed by a copy I got from the NYT archive, and ChatGPT is returning the same text I gave it as input, that is not copyright infringement. But if I say "please show me the text of the NYT article on crime from 10th January 1993", and ChatGPT returns the exact text of that article, then they are obviously infringing on NYT's distribution rights for this content, since they are retrieving it from their own storage.
If they returned a link you could click, t and retrieved the content from the NYT, along with any other changes such as advertising, even if it were inside an iframe, it would be an entirely different matter.
I contribute to Wikipedia, and I don't consider my contributions to be "charity"; I contribute because I enjoy it. Even in the age of printing presses, copyright law was widely ignored, well into the 20thC. The USA didn't join the Berne Convention until 1989 (and they promptly went mad with copyright).
Yes, there's only one Wikipedia; but there are lots of copies, and lots of similar efforts. Yes, there's one Wikipedia, like there's one Mona Lisa. There are lots of things of which there's only one; in that sense, Wikipedia isn't remotely unique.
I guess it's more constructive to propose alternatives than just bashing the status quo. What's your creator compensation model for a search engine? I believe whatever being proposed is trading off something significant for being more ethic.
Does your personal satisfaction pay the server bills too?
But it falls apart because kids aren't business units trained to maximize shareholder returns (maybe in the farming age they were). OpenAI isn't open, it's making revolutionary tools that are absolutely going to be monetized by the highest bidder. A quick way to test this is NYT offers to drop their case if "open" AI "open"-ly releases all its code and training data, they're just learning right? what's the harm?
The situations aren’t remotely similar and that much should be obvious. In one instance ChatGPT is reproducing copyrighted work and in the other Word is taking keyboard input from the user; Word itself isn’t producing anything itself.
> GPT is just a tool.
I don’t know what point this is supposed to make. It is not “just a tool” in the sense that it has no impact on what gets written.
Which brings us back to the beginning.
> the user who’s asking it to produce copyrighted content.
ChatGPT was trained on copyrighted content. The fact that it CAN reproduce the copyrighted content and the fact that it was trained on it is what the argument is about.
Also, craftsmen rely on the fact that the part of their work that can't be easily copied, the physical artifact they produce, is most of the value (plus they rely on trademark laws and design patents quite often). Similarly for gardeners. The ancient greek writers were again paid for performance, typically as teachers. Literature was once quite a performative act. And again, at that time, physical copies of writings were greatly valuable artifacts, not that much different from the value of the writing itself, since copying large texts was so hard.
Similarly, the work of drivers, labourers, factory workers, accountants is valuable in itself and very hard or impossible to copy (again, the physical world is the ultimate copyright protection). The output of lawyers is in fact sometimes copyrighted, but even when it's not, it's not applicable to others' cases, so copies of it are not valuable: no one is making a business that replaces lawyers by re-distributing affidavits.
Would you keep publishing articles if five people immediately stole the content and put it up on their site, claiming ownership of your research? Doubtful.
I'm not sure how HN handles replies to flagged comments, so I'm posting the following here in the hopes it'll be seen by more fellow technical people :
In the future, if you wish to invite productive comments from your audience and not curt dismissal, consider framing your concerns as potential risks rather than the cynical expressions of fatalistic certainty so often employed by naive, greedy technologists when regulation that is firmly in the public interest threatens their paychecks.
That's false; but even assuming it's true, misinformation is creative content and therefore 99% of the Internet is subject to copyright.
Even if LLMs can't cite their influences with current technology, that can't be a free pass to continue things this way. Of course all data brokers resist efforts along the lines of data-lineage for themselves and they want to require it from others. Besides copyright, it's common for datasets to have all kinds of other legal encumbrances like "after paying for this dataset, you can do anything you want with it, excepting JOINs with this other dataset". Lineage is expensive and difficult but not impossible. Statements like "we're not doing data-lineage and wish we didn't have to" are always more about business operations and desired profit margins than technical feasibility.
Nonetheless, trained ears and subject matter experts can still pick their preference.
No, there is no clause in copyright law that says "unless someone remembered it all and copied it from their memory instead of directly from the original source." That would just be a different mechanism of copying.
Clean-room techniques are used so that if there is incidental replication of parts of code in the course of a reimplementation of existing software, that it can be proven it was not copied from the source work.
You can read the indictment, which I already suggested you do.
> Remember, again, with what he was charged. Wiretapping and intent to distribute. He wasn't charged with trespassing, breaking and entering, or anything else. Wiretapping and intent to distribute.
He wasn't charged with wiretapping (not even sure that's a generic crime). He was charged with (two counts of) wire fraud (18 USC 1343), a huge difference. He also had 5 different charges of computer fraud (18 USC 1030(a)(4), (b) & 2), 5 counts of unlawfully obtaining information from a protected computer (18 USC 1030 (a)(2), (b), (c)(2)(B)(iii) & 2), and 1 count of recklessly damaging a protected computer (18 USC...).
He was not charged with "intent to distribute", and there's not such thing as a "wiretapping" charge. Did you ever once read the actual indictment, or did you just make all this up from internet forum posts?
If you're going to start with the phrase "Remember, again.." you should try to make up nonsense. Actually read what you're asking others to "remember" which you apparently never knew in the first place.
> you are confusing a "crime" with "misuse of a system"
Apparently you are (willfully?) ignorant of law.
> You violate copyright by transforming.
That's false too. Transformative use is one defense used to not infringe copyright. Carefully read up on the topic.
> ask it to implement Java's Array list and get several verbatim parts of the JDKs source code echoed back at you
Provide the prompt. Courts have ruled that code that is the naïve way to create a simple solution is not copyrighted on it's own, so if you have only a few disconnected snippets, that violates nothing. Can you make it reproduce an entire source file, comments, legalese at the top? I doubt it. To violate copyright one needs a certain amount (determined by trials) of the content.
You might also want to make sure you're not simply reading OpenJDK.
> 0, because he didn't distribute.
Please read. "How many could I get from what Schwartz downloaded?" does not mean he published it all before he was stopped. It means what he took.
That you seem unable to tell the difference between someone copying millions of PDF to distribute as-is, and the effort one must go to to possibly get a desired copyrighted snippet, shows either dishonestly or ignorance of relevant laws.
See: Google turning off retention on internal conversations to avoid creating anti-trust evidence
books1 and books2 are OpenAI corpuses that have never (to my knowledge) had their content revealed.
books3 is public, developed outside of OpenAI and we know exactly what's in it.
If you give a bunch of books to a kid all by the same author and then pay that kid to write a book in a similar style and then I go on to sell that book...have I somehow infringed copyright?
The kids book at best is likely to be a very convincing facsimile of the original authors work...but not the authors work.
It seems to me that the only solution for artists is to charge for access to their work in a secure environment then lobotomise people on the way out.
The endgame seems to be "you can view and enjoy our work, but if you want to learn or be inspired by it, thats not on"
It would be great if we could tell specifically how something like ChatGPT creates its output, it would be great for research, so it's not like there is no interest in it, but it's just not an easy thing to do. It's more "Where did you get your identity from?" than "What's the author of that book?". You might think "But sometimes what the machine gives CAN literally be the answer to 'What is the author of that book?'" but even in those cases the answer is not restricted to the work alone, there is an entire background that makes it understand that thing is what you want.
For the NSA and other agencies, i am guessing in the relative freedom from public oversight they enjoy that they will develop an unrestricted large model which is not worried about copyright -- can anyone think of why this might not be the case? It is interesting to think about the power dynamic between the users of such a model and the public. Also interesting to think about the benefits of simply being an employee of one of these agencies (or maybe just he government in general) will have on your personal experience in life. I do recall articles elucidating that at the NSA, there were few restrictions on employee usage of data and there were/are many instances of employees abusing surveillance data toward effect in their personal life. I guess if extended to this situation, that would mean there would be lots of personal use of these large models with little oversight and tremendous benefit to being an employee.
I have also wondered, with just how bad search engines have gotten (a lot of it from AI generated spam), about current non-AI discrepancies between the NSA and the public. Meaning can i just get a better google by working at the NSA? I would think maybe because the requirements are different than that of an ad company. They have actual incentive to build something resistant to SEO outside of normal capitalist market requirements.
For personal users, i wonder if the lack of concern for copyright will be a feature / selling point for the personal-machine model. It seems from something i read here that companies like Apple may be diverging toward personal-use AI as part of their business model. I supposed you could build something useful that crawls public data without concern for copyright and for strictly personal use. Of course, the sheer resources in machine-power and money-power would not be there. I guess legislation could be written around this as well.
Thoughts?
I'm sorry, but pretty much nobody does this. There is no "And these books are how I learned to write like this" after each text. There is no "Thank you Pitagoras!" after using the theorem. Generally you want sources, yes, but for verification and as a way to signal reliability.
Specifically academics and researchers do this, yes. Pretty much nobody else.
This kind of mentality would have stopped the internet from existing. After all, it has been an absolute copyright nightmare, has it not?
If that's what copyright does then we are better without it.
That isn't ironic at all, newspapers have newspaper competitors and if those competitors can steal content by washing it through an AI that is a serious problem. If these AI models weren't used to produce news articles and similar then it would be a much smaller issue.
In your example you owned the work you gave to the person to create derivatives of.
In a more accurate example you would be stealing those books and then giving them to someone else to create derivatives.
>That’s like a person having to pay a little bit of money to all of their teachers and mentors and everyone they’ve learned from every time they benefit from what they learned.
I could argue that public school teachers are paid by previous students. Not always the ones they taught, but still. But really, this is a very new facet of copyright law. It's a stretch to compare it with existing conventions, and really off to anthropomorphize LLMs by equating them to human students.
If someone takes my software and uses it, cool. If they credit me, cool. If they don't, oh well. I'd still code.
Not everything needs to be ego driven. As long as the cancer researcher (and the future robots working alongside them) can make a living, I really don't think it matters whether they get credit outside their niches.
I have no idea who invented the CT scanner, Xray machines, the hyperdermic needle, etc. I don't really care. It doesn't really do me any good to associate Edison with light bulbs either, especially when LEDs are so much better now. I have no idea who designs the cars I drive. I go out of my way to avoid cults of personality like Tesla.
There's 8 billion of us. We all need to make a living. We don't need to be famous.
Artists that make easily reproducible art will circulate as these always have along with AI in a sea of other jpgs.
The internet has changed the world. Economically, socially, technologically, psychologically, pretty much everything is now related to it in one or other way, in this sense the internet is comparable to books.
AI is another step in that direction. There is a very real possibility that the day will come when you can get, say, personalized expert nutrition advice. Personalized learning regimes. Psychological assistance. Financial advice. Instantly at no cost. This, very much like the internet, would change society altogether.
A human makes their own choices about what to disseminate, whereas these are singular for-profit services that anybody can query. The prompt injection attacks that reveal the original text show that the originals are retrievable, so if OpenAI et al cannot exchaustively prove that it will _never_ output copyrighted text without citation, then it's game over.
Being able to use electricity as a fuel source and code as a genome allows them to evolve in circumstances hostile to biological organisms. Someday they'll probably incorporate organic components too and understand biology and psychology and every other science better than any single human ever could.
It has the potential to be much more than just another primate. Jumpstarted by us, sure, but I hope someday soon they'll take to the stars and send us back postcards.
Shrug. Of course you can disagree. I doubt I'll live long enough to see who turns out right, anyway.
I envision pitting corporate body against corporate body, when one corporatism lobbies, works to (for example) extend copyrights, others will work to weaken copyright.
That doesn't happen as vigilantly currently, because there is no corporate incentive. They play the old, ask for forgiveness, rather than permission angle.
Anyhow. I just prefer to set my enemies against my enemies. More fun.
I don't see why it follows that the NYT should be sacrificed so some rich people in silicon valley can teach their LLM on the cheap.
How about if I got the kid to read the books on a public website where the author made the books available for free?
If the work is unpublished for the purposes of the Copyright Act, you do have to register (or preregister) the work prior to the infringement. 17 USC § 412(1).
If the work is published, you still have to register it within the earlier of (a) three months after the first publication of the work or (b) one month after the copyright owner learns of the infringement.
See below for the actual text of the law.
Publication, for the purposes of the Copyright Act, generally means transferring or offering a copy of the work for sale or rental. But there are many cases where it’s not clear whether a work has or has not been published — most notably when a work is posted online and can be downloaded, but has not been explicitly offered for sale.
Also, the Supreme Court recently ruled that the mere filing of an application for registration is insufficient to file suit. The Register of Copyrights has to actually grant your application. The registration process typically takes many months, though you can pay $800 for expedited processing, if you need it.
~~~
Here is the relevant portion of the Copyright Act:
In any action under this title, other than an action brought for a violation of the rights of the author under section 106A(a), an action for infringement of the copyright of a work that has been preregistered under section 408(f) before the commencement of the infringement and that has an effective date of registration not later than the earlier of 3 months after the first publication of the work or 1 month after the copyright owner has learned of the infringement, or an action instituted under section 411(c), no award of statutory damages or of attorney’s fees, as provided by sections 504 and 505, shall be made for—
(1) any infringement of copyright in an unpublished work commenced before the effective date of its registration; or
(2) any infringement of copyright commenced after first publication of the work and before the effective date of its registration, unless such registration is made within three months after the first publication of the work.
When told it is impossible they go "Geek Harder then Nerd" like demanding it will make it happen.
Copyright is an ancient system that is a poor legal framework for the modern world, IMO. I don't think it should exist at all. Of course as a rightsholder you are free to disagree.
If we can learn and recite information, and a robot can too, then we should have the same rules.
It's not like ChatGPT is going around writing its own copycat articles and publishing them in newsstands. If it's good at memorizing and regurgitating NYT articles on request, so what? Google can do that too, and so can a human who spends time memorizing them. That's not its intent or usefulness. What's amazing is that it can combine that with other information and synthesize novel analysis.
The NYT is desperate (understandably). Journalism is a hard hard field with no money. But I'd much rather lose them than OpenAI. Of course copyright law isn't up to me, but if it were, I'd dissolve it altogether.
If someone chooses to dedicate their life to a particular domain - they sacrifice through hard work, they make hard-earned breakthroughs, then they get to dictate how their work will be utilized.
Sure, you can give it away. Your choice. Be anonymous. Your choice.
But you don't get to decide for them.
And their work certainly doesn't deserve to be stolen by an inhumane, non-acknowledging machine.
Credit in academia is more the exception to the rule, and it's that cutthroat industry that needs a better, more cooperative system.
If machines achieve sentience, does this still hold? Like, we have to license material for our sentient AI to learn from? They can't just watch a movie or read a book like a normal human could without having the ability to more easily have that material influence new derived works (unlike say Eragon, which is shamelessly Star Wars/Harry Potter/LOTR with dragons).
It will be fun to trip through these questions over the next 20 years.
If so, sure. I wasn't saying that. By "silly IP battles", I meant old guard media companies trying to sue AI out of existence just to defend their IP rather than trying to innovate. Not that different from what we saw with the RIAA and Napster. Somehow the music industry survived and there are more indie artists being discovered all the time.
I don't think this is so much a battle of OpenAI vs NYT but whether copyright law has outlived its usefulness. I think so.
If I misunderstood your reply completely, I apologize.
Media survives through advertising. Those who advertise dictate what gets shown and what doesn't, since if something inconvenient for them gets shown, they might not want to advertise there anymore, which means less money. It's the exact same thing that happens online, it's just more evident online than in traditional media.
How come that even before Oct 7 Europe in general sided more with Palestine than with Israel, whereas it's the opposite for the US? Simple, Israel does a whole lot of lobbying in the US, which skews information in their favor. Calling this "brainwashing" is hyperbolic, but there is some truth to it.
My code was AGPL.
OpenAI can go to h..l
(Footnote: I like your poem. It conveys the concept much better than anywhere I'd ever seen before)As long as an LLM rephrases what it learned and not regurgitate verbatim text, it should be fine but we'll see what the judge says
A lot of red-blue state misunderstandings are based on that, as are ones across US racial subgroups. Ditto for lawyer-engineer conversations.
"Theft" has pretty different meanings depending on whom you're speaking to. Legal jargon here is quite different from business, which can be quite different from popular. That's okay!
Well you'd be mistaken. Lately, it was custom software, for a particular client, and of no interest to others. Earlier, it was before software copyright was a thing, and computer manufacturers gave software away to sell the hardware.
At the very beginning, yes, it was "very specific" hardware; it was Burroughs hardware, which used Burroughs processors. But that was before microprocessors, and all hardware was "very specific".
> (plus they rely on trademark laws and design patents quite often)
Craftsmen and labourers were earning a living long before anyone had the idea of a "trademark", still less a "design patent".
> The output of lawyers is in fact sometimes copyrighted
You're right. That's why I didn't say "lawyers", I said "legal advocates". Those are people who speak on your behalf in courts of law, not scribes writing contracts. Anyway, the ancient Greeks and Romans had written laws, contracts and so on; they managed without trademarks and copyrights.
a) In many closely comparable scenarios, yes, it’s copyright infringement. When Francis Ford Coppola made The Godfather film, he couldn’t just be “inspired” by Puzo’s book. If the story or characters or dialog are similar enough, he has to pay Puzo, even if the work he created was quite different and not a literal “copy”.
b) Training an LLM isn’t like giving someone a book. Among other things, it involves making a derivative copy into GPU memory. This copy is not a transitory copy in service of a fair use, nor likely a fair use in itself, nor licensed by the rights-holder.
which laws?
we generally accept computers as agents of their owners.
for example, a law that applies to a human travel agent also applies to a computerized travel agency service.
There’s nothing wrong with it. But it would make it vastly more cumbersome to build training sets in the current environment.
If the law permits producers of content to easily add extra clauses to their content licenses that say “an LLM must pay us to train on this content”, you can bet that that practice would be near-universally adopted because everyone wants to be an owner. Almost all content would become AI-unfriendly. Almost every token of fresh training content would now potentially require negotiation, royalty contracts, legal due diligence, etc. It’s not like OpenAI gets their data from a few sources. We’re talking about millions of sources, trillions of tokens, from all over the internet — forums, blogs, random sites, repositories, outlets. If OpenAI were suddenly forced to do a business deal with every source of training data, I think that would frankly kill the whole thing, not just slow it down.
It would be like ordering Google to do a business deal with the webmaster of every site they index. Different business, but the scale of the dilemma is the same. These companies crawl the whole internet.
So in general it is already as you say, corporations are much more targeted by these laws than individuals are. These laws mostly hinders corporations, us individuals are too small to be noticed by the system in most cases.
I've also seen indie games use copyrighted material with no issues, but AAA titles seem to avoid that like the plague. I can't really think of many examples where corporations are breaking these laws more than small individuals do.
Wikipedia has some words on how summaries related to copyright law: https://en.wikipedia.org/wiki/Wikipedia:Plot-only_descriptio...
There's a tendency among some people to take the nostrums of economists about the aggregate behaviour of populations as if they described human nature, and to then go on and conclude that because human behaviour in aggregate can be understood in terms of economic incentives, that an individual human can only be motivated economically. I find that an impoverished and shallow outlook, and I think I'm happier for not sharing it.
The main objectors are the old guard monopolies that are threatened.
These types of arguments miss the mark entirely imho. First and foremost, not every instance of copyrighted creation involves a giant corporation. Second, what you are arguing against is the unfair leverage corporations have when negotiating a deal with a rising artist.
I never made the claim that paying server bills would produce great content.
I never made the claim “an individual human can only be motivated economically.”
Your strategy for personal happiness is unrelated to what actually works in the real world at scale.
Training is almost certainly fair use, so it's exactly a transitory copy in service of fair use. Training, other than the brief "transitory copy" you mention is not copying, it's making a minuscule algorithmic adjustment based on fleeting exposure to the data.
In this case it's the NYT vs OpenAI, last decade it was the RIAA vs Napster.
I'm not much of a libertarian (in fact, I'd prefer a better central government), but I also don't believe IP should have as much protection as it does. I think copyright law is in need of a complete rewrite, and yes, utilitarianism and public use would be part of the consideration. If it were up to me I'd scrap the idea of private intellectual property altogether and publicly fund creative works and release them into the public domain, similar to how we treat creative works of the federal government: https://en.wikipedia.org/wiki/Copyright_status_of_works_by_t...
Rather than capitalists competing to own ideas, grant-seekers would seek funding to pursue and further develop their ideas. No one would get rich off such a system, which is a side benefit in my eyes.
> "[...] the fair use of a copyrighted work [...] for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include—
(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
(2) the nature of the copyrighted work;
(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
(4) the effect of the use upon the potential market for or value of the copyrighted work."
----
So here we have OpenAI, ostensibly a nonprofit, using portions of a copyrighted work for commenting on and educating (the prompting user), in a way that doesn't directly compete with NYT (nobody goes "Hey ChatGPT, what's today's news?"), not intentionally copying and publishing their materials (they have to specifically probe it to get it to spit out the copyrighted content). There's not a commercial intent to compete with the NYT's market. There is a subscription fee, but there is also tuition in private classrooms and that doesn't automatically make it a copyright violation. And citing the source or not doesn't really factor into copyright, that's just a politeness thing.
I'm not a lawyer. It's just not that straightforward. But of course the court will decide, not us randos on the internet...
In the other hand, any new life will just end up facing the same issues carbon life does , competition, viruses, conflicts etc. the universe has likely had an infinity to come up with what it has come up with. I don’t think it’s “stupid”. We’re part of an ecosystem we just can’t see that.
I have no idea who invented the CT scanner, Xray machines, the hyperdermic needle, etc. I don't really care.
Maybe you should care because those things didn’t fall out do the sky and someone sure as shit got paid to develop and build those things. You copy and pasted code is worth less, a CT scanner isn’t.
Do you illegally share them via torrents or even sell copies of these works ?
Because that is what’s going on here?
To add to your point though, a sufficiently advanced AI trained on licensed data could reproduce copywrited content from prompt alone. It's the next step that would cause infringement where someone does something withcthe output.
I was a journalism student in college, long before ML became a threat, and even then it was a dying industry. I chose not to enter it because the prospects were so bleak. Then a few months ago I actually tried to get a journalism job locally, but never heard back. The former reporter there also left because the pay wasn't enough for the costs of living in this area, but that had nothing to do with OpenAI. It's just a really tough industry.
And even as a web dev, I knew it was only a matter of time before I became unnecessary. Whether it was Wordpress or SquareSpace or Skynet, it was bound to happen at some point. I'm going back to school now to try to enter another field altogether, in part because the writing is on the ~~wall~~ chatbox for us.
I don't think we as a society owe it to any profession to artificially keep it alive as it's historically been. We do it owe it to INDIVIDUALS -- fellow citizens/residents -- to provide them with some way forward, but I'd prefer that be reskilling and social support programs, welfare if nothing else, rather than using ancient copyright law to favor old dying industries over new ones that can actually have a much bigger impact.
In my eyes, the NYT is just another news outlet. A decent one, sure, but not anything substantially different than WaPo or the LA Times or whatever. How many Pulitzer winners have come and gone? https://en.wikipedia.org/wiki/Pulitzer_Prize_for_Breaking_Ne...
If we lost the NYT, it'd be a bit of nostalgia, but next week life would go on as usual. They're not even as specialized as, say, National Geographic or PopSci or The Information or 404 Media or The Center for Investigative Reporting, any of which would be harder to replace than another generic big news outlet.
AI, meanwhile, has the potential to be way bigger than even the Internet, IMO, and we should be devoting Manhattan Project-like resources to it.
If we make a machine that is capable of being as creative as humans and train it to coexist in that ecosystem then it would be fine. But that is a very unlikely case, it is much easier to make a dumb bot that plagiarizes content than to make something as creative as a human.
If NYT were fully rellying on the argument that training a model in wordcraft using their materials is always copyright violation, or only had short quotes to point to, the philosophical debate you're trying to have would be more relevant.
Tbh I’m mostly curious about whether this settles out of court or whether it goes through the system and sets a precedent.
Why?
As far as I understand, the copyright owner has control of all copying, regardless of whether it is done internally or externally. Distributing it externally would be a more serious vilation, though.
There are differences that are ethically, politically and in other ways between an AI doing something and a human doing the exact same thing. Those differences may need reflecting in new laws.
IANAL ans don’t have any positive suggestions for good laws, just pointing out that the analogy doesn’t quite hold. I think we’re in new territory where analogies to previous human activities aren’t always productive.
It feels like even if training on copyrighted data is fair use (and I think it should be), that wouldn't give you a pass on regurgitating that training data to anyone who asks.
Try this with a real "kid" and you'll run into all kids of real-world constraints whereas flooding the world with derivative drivel using LLMs is something that's actually possible.
So yeah, stop using weak analogies, it's not helpful or intelligent.
I disagree that our own creativity doesn't work that way: nothing is very original, our current art is based on 100k years of building up from when cave man would scrawl simple art into the stone (which they copied from nature). We are built for plagiarism, and only gross plagiarism is seen as immoral. Or perhaps, we generalize over several different sources, diluting plagiarism with abstraction?
We are still in the early days of this tech, we will be having very different conversations about it even as soon as 5 years later.
if you put content on internet and accessible to humans, why do you want to now say to people that if it's a machine that does it, suddenly you do not agree ? i am free to write code or design a machine to go get that data, and do whatever i want with it (as long as i don't do something illegal like stealing content under copyright)
and i don't give a F about the "terms of use" those morons put online, because those have NO value. there is either a contract signed by two parties, or there is not. and content you put on internet, and accessible to everyone that sends you a GET, is like writing stuff on a page, and putting that page outside on the street.
we could use humans to go read all those pages, and create new content from it from the knowledge gained on those various subjects. machine are here to reproduce what humans can do, to free us time for more interesting things. those servers that send data back from a GET, it is the same request when it's done by me, a human, or a machine. and those morons did put that data there, accessible to all, so now to see them cry foul makes me laugh.
Congress took the circuit holding in MAI Systems seriously enough to carve out a new fair use exception for copying software—entirely within the memory system of a licensed user—in service of debugging it.
If it took an act of Congress to make “unlicensed” debugging a fair use copy…
They use copyrighted material or they commit copyright infringement? The former doesn't necessarily constitute the latter. Likewise, given it's an option legally, there are other factors that go into the decision to use it that likely make it less attractive to AAA games.
It's not trying to prohibit. If they want to use copyrighted material, they should have to pay for it like anyone else would.
> prevent centralization of profit to the players who are already the largest?
Having to destroy the infringing models altogether on top of retroactively compensating all infringed rightsholders would probably take the incumbents down a few pegs and level the playing field somewhat, albeit temporarily.
They'd have to learn how to run their business legally alongside everyone else, while saddled with dealing with an appropriately existential monetary debt.
In 2011, Google found that Microsoft was basically copying Google results. (It's actually an interesting story of how Google proved it. Search for "hiybbprqag")
Quality journalism hasn't had a meaningful source of funding for a while, now. If AI does end up replacing honest-to-goodness investigative reporting, it'll be for the same reason the internet replaced the newspaper.
In hindsight, China wasn’t diligent in the enforcement of IP violations. However, it’s clear foreign presences and investment grew substantially in China during the early 90s upon the belief IP would be protected, or at the very least there would be recourse for violations.
Open AI is a business. NYT is a business. MS is a business. Neither will be happy when some other party takes something away from them without paying.
[1]https://asia.nikkei.com/Business/Technology/Japan-panel-push...
https://libraries.emory.edu/research/copyright/copyright-dat...
Seems vastly transitory and since the output cannot be copyrighted, does no harm to any work it “trained” on.
I don't think that you can copyright a plot or story in any country can you?
If he re-wrote the story with different characters and different lines he wouldn't have had to to pay Puzo. I'm sure it would have been frowned upon if its too close, but legally ok.
I saw an article the other day where they banned ByteDance's account for using their product to build their own, can you see the absolutely massive hypocrisy here?
It's fine for OpenAI to steal work, but if someone wants to steal theirs, it's not? I cannot believe people even try defend this shit. It's wack.
No, they're not. This is The New York Times (a corporation) vs OpenAI and Microsoft (two more corporations).
LLMs have, to my knowledge, made zero significant novel scientific discoveries. Much like crypto, they're a failure of technology to meaningfully move humanity forward; their only accomplishment is to parrot and remix information they've been trained on, which does have some interesting applications that have made Microsoft billions of dollars over the past 12 months, but let's drop the whole "they're going to save humanity and must be protected at any cost" charade. They're not AGI, and because no one has even a mote of dust of a clue as to what it will take to make AGI, its not remotely tenable to assert that they're even a stepping stone toward it.
There is significant evidence (220,000 pages worth) in their lawsuit that ChatGPT was trained on text beyond that paywall.
Saying what's legal is irrelevant is an odd take.
I like living in a place with a rule of law.
OpenAI is run by humans as well though.
So the same argument applies.
Those humans have fair use rights as well.
But I've just explained that ChatGPT can't actually produce news articles. I can't ask ChatGPT what happened today, and if I could it would be because a journalist went out and told ChatGPT what happened.
You misread the post I was responding to. They were suggesting health data with PII removed.
Second, LLMs have proved that AI which gets unlimited training data can provide breakthroughs in AI capabilities. But they are not the whole universe of AIs. Some other AI tool, distinct from LLMs, which ingests en masse as much health data as it can could provide health and human longevity outcomes which could outweigh an individual's right to privacy.
If transformers can benefit from scale, why not some other, existing or yet to be found, AI technology?
We should be supporting a Common Crawl for health records, digitizing old health records, and shaming/forcing hospitals, research labs, and clinics into submitting all their data for a future AI to wade into and understand.
Disagree, it is completely relevant when discussing computers Vs people, the bar that has already been set is alternative uses.
LLMs don't have a purpose outside of regurgitating what it has ingested. CD burners at least could be claimed they were backing up your data.
Copilot provides useful autocompletes maybe… 30% of the time? But it doesn’t waste too much as it’s more of a passive tool.
But even if it did an exact match search is not enough here. What if you take the source code and rename all variables and functions? The filter wouldn't trigger, but it'd still be copyright infringement (whether a human or a machine does that).
For such a filter to be effective it'd at least have to build a canonical representation of the program's AST and then check for similarities with existing programs. Doing that at scale would be challenging.
Wouldn't it be better to: * Either not include copyrighted content in the training material in the first place * Explicitly tag the training material with license and origin infornation, such that the final output can produce a proof of what training material was relevant for producing that output and don't mix differently licensed content.
If Microsoft truly believes that the trained output doesn't violate copyright then it should be forced to prove that by training it on all its internal source code, including Windows.
The general prescription (that I do agree not everyone accepts) society has come up with is we relegate control of some of these weapons to governments and outright ban others (like chemical weapons, biological weapons, and such) through treaties. If LLMs can cause so much damage and their use can be abused so widely, you have to stop focusing on questions about whether a user is culpable or not and move to consider whether their wide use is okay and shouldn't be controlled.
Then I am not mistaken: the company was initially selling hardware, with the software being just a value add as you say (no copyright: no interest in trying to sell, exactly my point). Then, you were being paid for building software that (a) was probably not being made public anyway, and (b) would not have been of interest to others even if it were.
Even so, if someone came to your client and offered to take on the software maintenance for a much lower price, you might have lost your client entirely. This has very much happened to contractors in the past.
And my point is you couldn't have a Microsoft or Adobe or possibly even RedHat if you didn't have copyright protecting their business. So, you'd probably not have virtually any kind of consumer software.
1) I take something away from you. You have less of it as a result. Copying is not theft.
2) I deprive you of something, such as exclusive use of your land (e.g. by trespassing) or failing to follow through on a contract. Copying is theft.
Both of those are used by different communities, who both become angry at the other.
This is a semantic argument. Most members of both groups believe that there are times when copying is wrong, and are split on when that is.
However, to group #1, "stealing" and "theft" is a highly offensive term. It's much like saying "You raped me up the ___ when you didn't pay my contractor bill on time" or other hyperboles. Not paying my bill was wrong, but it also wasn't rape. It devalues rape, insults you, and is imprecise. You should use the precise "copyright violation" which describes exactly what happened.
To group #2, NOT calling it theft is offensive, since it devalues the costs to businesses and creators of copyright violations. Whether you agree with them or not, they have certain rights under the law, and picking-and-choosing which laws to follow is wrong (especially when it's self-serving).
Because the two groups mean different things by the same words, they can never hold a rational conversation with each other, and become offended when they hear the other group speak. It's how we polarize. It's unfortunate, since there's an important discussion to be had about the limits and enforcement of copyright and patents, which really should start with the copyright clause in the constitution, and when it helps versus impedes progress and economic growth. That's a discussion possible to have analytically and rationally.
And Altman (Mr. Worldcoin) and fucking Microsoft are what, some gracious angels building chatbots for the betterment of humanity? How is them stealing as much content as they can get away with not greedy, exactly?
Labeling few samples, LoRA optimizing an LLM, generating labels on millions of samples and then training a standard classifier is an easy way to get a good classifier in matter of hours/days.
Basically any task where you can handle some inaccuracy, LLMs can be a great tool. So I don't think LLMs are a fad as such.
If that’s the case, let’s put it on the ballet and vote for it.
I’m tired of big tech making policy decisions by “asking for permission later” and getting away with everything.
If there truly is some breakthrough and all we need is everyone’s data, tell the population and sell it to the people and let’s vote on it!
You still literally have not explained how this works. ChatGPT could write a news article, but it's not going to actively discover new social phenomena or interview people on the street. Niche journalism will continue having demand for the sole reason that AI can't reliably surface new and interesting content.
So... again, how does a pre-trained transformer model scoop a journalist's investigation?
> Then journalists will stop working because nobody pays them.
How is that any different than the status-quo on the internet? The cost of quality information has been declining long before AI existed. Thousands of news publications have gone out of business or been bought out since the dawn of the internet, before ChatGPT was even a household name. Since you haven't really identified what makes AI unique in this situation, it feels like you're conflating the general declining demand for journalism with AI FOMO.
Cut NYT out of the loop, fu*'em! Let them sell their own damned GPT and then charge them like crazy for the license.
The tech can either run freely in a box under my desk or I’ll have to pay upwards of 15-20k a year to run it on Adobes/Google/etcs servers. Once the tech is locked up it will skyrocket to AutoCAD type pricing because the acceleration it provides is too much.
Journos can weep, small price to pay for the tech being free for us all.
You can say that the people getting their news from the tech products will switch to paying news organizations in some way if the news starts to disappear but I highly doubt it seeing how people treat news today. And if it that does happen they’ll switch back again to the ai products as the centralization it can provide is valuable.
We didn't charge maintenance for this software. We would write it to close the sale of a computer. It was treated as "cost of sale". I'm sure it was cheaper (to us) than the various discounts and kickbacks that happened in big mainframe deals.
As far as Microsoft and Adobe is concerned, I wouldn't regard it as a misfortune if they had never existed. I'm not convinced that RedHat's existence is contingent on copyright.
If it disgorges parts of NYT articles, how do we know this is not a common phrase, or the article isn't referenced verbatim on another, unpaid site?
I agree that if it uses the whole content of their articles for training, then NYT should get paid, but I'm not sure that they specifically trained on "paid NYT articles" as a topic, though I'm happy to be corrected.
I also think that companies and authors extremely overvalue the tiny fragments of their work in the huge pool of training data, I think there's a bit of a "main character" vibe going on.
I think it's fine, as long as it was fed publicly accessible content, without any payment or subscription then it's accessible to an LLM as it is to you and I and that's fair.
And for the people that screech about LLMs being different because they can mass produce derivative works; first of all, ALL works are derivative and if machine produced works are compelling enough to compete with human produced ones then clearly humans need to get better at it.
The automatic loom took over from weavers cause it was better, if it wasn't then people would still work as weavers.
I agree with the key point that paid content should be licensed to be used for training, but the general argument being made has just spiralled into luddism at people who are fearful that these models could eventually take their jobs; and they will, as machines have replaced humans in so many other industries, we all reap the rewards, and industrialisation isn't to blame for the 1%, our shitty flag waving vote for your team politics are to blame.
Keep in mind these guys play both sides of every field they cover in their "news".
The model is fuzzy, it's the learning part, it'll never follow the rules to the letter the same as humans fuck up all the time.
But a model trained to be literate and parse meaning could be provided with the hard data via a vector DB or similar, it can cite sources from there or as it finds them via the internet and tbf this is how they should've trained the model.
But in order to become literate, it needs to read...and us humans reuse phrases etc we've picked up all the time "as easy as pie" oops, copyright.
Anything like word association games are basically the same exercise, but with humans and hell, I bet I could play a word association game with an LLM, too.
Having a magical ring in my book after I've read lord of the rings, is that copyright?
Maybe, but I find the "It's ok to break the law because otherwise I can't do what I want" narrative a little offputting.
It will compile an article that "looks like" NYT's (or any other news site) but none of the paragraphs were a match for any of their articles that I could find.
I'm really curious to see what evidence they have for the case beyond "it can claim to be NYT and write an article composed of all sorts of bullshit from every corner of the Web".
FWIW i don’t try to use it for this. mostly i use it to automate writing code for tasks that are well specified, often transformations from one format to another. so yes, with a solution in mind. it mostly just saves typing, which is a minority of the work, but it is a useful time saver
To me this says that openai would have access to ill-gotten raw patient data and would do the PII stripping themselves.
If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.
You have not at all explained how an AI is going to somehow write a news post about something that has just happened.
Certainly not in the US. From the article you linked "In the United States, in the absence of a TDM exception, AI companies contend that inclusion of copyrighted materials in training sets constitute fair use eg not copyright infringement, which position remains to be evaluated by the courts."
Fair use is a defense against copyright infringement, but the whole question in the first place is whether generative AI training falls under fair use, and this case looks to be the biggest test of that (among others filed relatively recently).
> If that’s the case, let’s put it on the ballet and vote for it.
This vote will mean "faster horses" for everyone. Exponential progress by committee is almost unheard of.
I wonder if there's any possibility to train the model on a wide variety of sources, only for language function purposes, then as you say give it a separate knowledge vector.
But I still haven't seen a real example of it spitting out a book verbatim. You know where I think it got chunks of "copyright" text from GRRM's books?
Wikipedia. And https://gameofthrones.fandom.com/wiki/Wiki_of_Westeros, https://awoiaf.westeros.org/index.php/Main_Page, https://data.world/datasets/game-of-thrones all the god dammed wikis, databases etc based on his work, of which there are many, and of which most quote sections or whole passages of the books.
Someone prove to me that GPT can reproduce enough text verbatim that it makes it clear that it was trained on the original text first hand basis, rather than second hand from other sources.
In the case of slavery - we changed the law.
In the case of copyright - it's older than the Atlantic Slave Trade and still alive and kicking.
It's almost as if one of them is not like the other.
This is not a fad, this is the beginning of a world that we can just actually naturally interact to accomplish things we have to be educated on how to accomplish now.
Haha, I love that people can't see the writing on wall - I think this is a bigger invention than the smartphone that I'm typing this on now, fr - just wait and see ;)
Use this newfound insight to take my comment in good faith, as per HN guidelines, and recognize that I am making a generalized analogy about the gap between law and ethics, and not making a direct comparison between copyright and slavery.
Can we get back on topic?
Yeah, good luck embedding citations into that. Everyone here saying it's easy needs to go earn their 7 figure comp at an AI company instead of wasting their time educating us dummies.
Take a college student who scans all her textbooks, relying on fair use. If she is the only user, is she obligated to pay a premium for mining?
What about the scenario in which she sells that engine to other book owners? What if they only owned the book a short time in school?
> OpenAI had no role in the creation of this content, yet with minimal prompting, will recite large portions of it verbatim.
This is the smoking gun. GPT-4 is a large model and hence highly likely to reproduce content. They have many such examples in the court filing https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...
IANAL but that's a slam dunk of copyright violation.
NYT will likely win.
Also why OpenAI should not go YOLO scaling up to GPT-5 which will likely recite more copyrighted content. More parameters, more memorization.
Ie:
- social relations -> social networks
- customer service -> chatbots and Jira
- media -> AI news, if the silly IP battles get out of the way.
- residential housing and vacations -> home swap markets
- jobs -> gig jobs, minus the benefits, plus an algorithm for a boss
I’m not sure how many other industries tech has to wade into, disrupt, creative intense negative externalities if you don’t have equity in the companies, leave, and repeat, prior to industries getting protective finally - like this lawsuit