I don’t necessarily fault OpenAI’s decision to initially train their models without entering into licensing agreements - they probably wouldn’t exist and the generative AI revolution may never have happened if they put the horse before the cart. I do think they should quickly course correct at this point and accept the fact that they clearly owe something to the creators of content they are consuming. If they don’t, they are setting themselves up for a bigger loss down the road and leaving the door open for a more established competitor (Google) to do it the right way.
This is also related to earlier studies about OpenAI where their models have a bad habit of just regurgitating training data verbatim. If your trained data is protected IP you didn’t secure the rights for then that’s a real big problem. Hence this lawsuit. If successful, the floodgates will open.
It is clear OpenAI or Google did not use only Common Crawl. With so many press conferences why did no research journalist ask yet from OpenAI or Google to confirm or deny if they use or used LibGen?
Did OpenAI really bought an ebook of every publication from Cambridge Press, Oxford Press, Manning, APress, and so on? Did any of investors due diligence, include researching the legality of the content used for training?
It matters what is legal and what makes sense.
OSS seems to be developing its own, transparent, datasets.
closed/proprietary services that also monetize - there's a question whether it's "fair" to take and use data for free, and then basically resell access to it. the monetization aspect is the bigger rub than just data use.
(maybe it's worth noting again that "openai" is not really "open" and not the same as open source ai/ml.)
taking data, maybe it's data that's free to take, and then as freely distributing resulting work, that's really just fine. taking something for free (without distinction, maybe it's free, maybe it's supposed to stay free, maybe it's not supposed to be used like that, maybe it's copyrighted), and then just ignoring licenses/relicensing and monetizing without care, that's just a minefield.
If there's legally-murky secret data sauce, it's firewalled from being easily seen in its entirety by anyone not golden-handcuffed to the company.
They may be able to train against it. They may be able to peek at portions of it. But no one is downloading-all.
Is using something, in its entirety, as a tiny bit of a massive data set, in order to produce something novel... infringing?
That's a pretty weird question that never existed when copyright was defined.
I personally think that giving copyright holders control over who is legally allowed to view a work that has been made publicly available is a huge step in the wrong direction. One of those reasons is open source, but really that argument applies just as well to making sure that smaller companies have a chance of competing.
I think it makes much more sense to go after the infringing uses of models rather than putting in another barrier that will further advantage the big players in this space.
Apple is already doing this: https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...
Apple caught a lot of shit over the past 18 months for their lack of AI strategy; but I think two years from now they're going to look like geniuses.
In what sense are they claiming their generated contents as their own IP?
https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...
> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."
https://openai.com/policies/terms-of-use
> Ownership of Content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.
I think it makes sense to hold model makers responsible when their tools make infringement too easy to do or possible to do accidentally. However that is a far cry from requiring a little longer license to do the trainint in the first place.
It's really disgusting, IMO, that corporations that go above and beyond that sort of behavior are seeing NO federal investigations for this sort of behavior. Yet a private citizen does it and it's threats of life in prison.
This isn't new, but it speaks to a major hole in our legal system and the administration of it. The Feds are more than willing to steamroll an individual but will think twice over investigating a large corporation engaged in the same behavior.
Edit: same applies to humans. Just because a healthcare company puts up a S3 bucket with patient health data with “robots: *” doesn’t give you a right to view or use the crawled patient data. In fact, redistributing it may land you in significant legal trouble. Something being crawlable doesn’t provide elevated rights compared to something not crawlable.
I don’t know the solution, but I don’t like the idea that anything I post online that is openly viewable is automatically opted into being part of ML/AI training data, and I imagine that opinion would be amplified if my writing was a product which was being directly threatened by the very same models.
More importantly, ever case is unique so what really came up was a set of principles for what defines fair use, which will definitely guide this.
I agree. You can even listen to the NYT Hard Fork podcast (that I recommend btw https://www.nytimes.com/2023/11/03/podcasts/hard-fork-execut...) where they recently had Harvard copyright law professor Rebecca Tushnet on as a guest.
They asked her about the issue of copyrighted training data. Her response was:
""" Google, for example, with the book project, doesn’t give you the full text and is very careful about not giving you the full text. And the court said that the snippet production, which helps people figure out what the book is about but doesn’t substitute for the book, is a fair use.
So the idea of ingesting large amounts of existing works, and then doing something new with them, I think, is reasonably well established. The question is, of course, whether we think that there’s something uniquely different about LLMs that justifies treating them differently. """
Now for my take: Proving that OpenAI trained on NYT articles is not sufficient IMO. They would need to prove that OpenAI is providing a substitutable good via verbatim copying, which I don't think you can easily prove. It takes a lot of prompt engineering and luck to pull out any verbatim articles. It's well-established that LLMs screw up even well-known facts. It's quite hard to accurately pull out the training data verbatim.
> To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries
If copyright is starting to impede rather than promote progress, then it needs to change to remain constitutional.
Saying they don’t claim the rights over their output while outputting large chunks verbatim is the old YouTube scheme of upload movie and say “no copyright intended”.
He also did not distribute the information wholesale. What he planned on doing with the information was never proven.
OpenAI IS distributing information they got wholesale from the internet without license to that information. Heck, they are selling the information they distribute.
If OpenAI got their hands on an S3 bucket from Aetna (or any major insurer) with full and complete health records on every American, due to Aetna lacking security or leaking a S3 bucket, should OpenAI or any other LLM provider be allowed to use the data in its training even if they strip out patient names before feeding it into training?
The difference between this question or NYT articles is that this question asks about content we know should not be available publicly online (even though it is or was at some point in the past).
I guess this really gets at “do we care about how the training data was obtained or pre-processed, or do we only care about the output (a model’s weights and numbers, etc)
Of course, I’m not a lawyer and I know that in the US sticking to precedents (which mention the “verbatim” thing) takes a lot of precedence over judging something based on the spirit of the law, but stranger things have happened.
Google may simply have been obliged to follow suit.
Personally, I’m looking forward to pirate LLMs trained on academic content.
You can get basically-but-not-quite-exactly the copyrighted material that it was trained on.
Saw this a lot with some earlier image models where you could type in an artists name and get their work back.
The fact that AI models are having to put up guardrails to prevent that sort of use is a good sign that they weren't trained ethically and they should be paying a ton of licensing fees to the people whose content they used without permission.
As a counter argument it might be reasonable to instead say that the NYT delivers "current information" so perhaps it'd be fair to train your model on articles so long as they aren't too recent... but I think a lot of the information that the NYT now relies on for actual traffic is their non-temporal stuff - including things like life advice and recipes.
If somehow it could be proven without doubt that deanonymising that data wasn't possible (which cannot be done), then the harm probably wouldn't be very big aside from just general data ownership concerns which are already being discussed.
Unequivocally, yes.
LLMs have proved themselves to be useful, at times, very useful, sometimes invaluable assistants who work in different ways than us. If sticking health data into a training set for some other AI could create another class of AI which can augment humanity, great!! Patient privacy and the law can f*k off.
I’m all for the greater good.
If your "fair use" substantially negatively affects the market for the original source material, which I think is fairly clear in this case, the courts wont look favorably on that.
Of course, I think this is a great test case precisely because the power of "Internet scale" and generative AI is fundamentally different than our previous notions about why we wanted a "fair use exception" in the first place.
You actually need a lot more than that. Most significantly, you need to have registered the work with the Copyright Office.
“No civil action for infringement of the copyright in any United States work shall be instituted until ... registration of the copyright claim has been made in accordance with this title.” 17 USC §411(a).
Copyright is granted to the creator upon creation.
I think publications should be protected enough to keep them in business, so I don't really know what to make of this situation.
NYT just happens to be an entity that can afford to fight Microsoft in court.
That right ended when he used it to break the law. It was also for use on MIT computers, not for remote access (which is why he decided to install the laptop, also knowing this was against his "right to use").
The "right to use" also included a warning that misuse could result in state and federal prosecutions. It was not some free for all.
> and pull the JSTR information
No, he did not have the right to pull en masse. The JSTOR access explicitly disallowed that. So he most certainly did not have the "right" to do that, even if he were sitting at MIT in an office not breaking into systems.
> did it in a shady way
The word you're looking for is "illegal." Breaking and entering is not simply shady - it's illegal and against the law. B&E with intent to commit a felony (which is what he was doing) is an even more serious crime, and one of the charges.
> he did it that way because he didn't want someone stealing or unplugging his laptop
Ah, the old "ends justifies break the law" argument.
Now, to be precise, MIT and JSTOR went to great lengths to stop the outflow of copying, which both saw. Schwartz returned multiple times to devise workarounds, continuing to break laws and circumvent yet more security measures. This was not some simply plug and forget laptop. He continually and persistently engaged in hacking to get around the protections both MIT and JSTOR were putting in place to stop him. He added a second computer, he used MAC spoofing, among other things. His actions started to affect all users of JSTOR at MIT. The rate of outflow caused JSTOR to suffer performance, so JSTOR disabled all of MIT access.
Go read the indictment and evidence.
> OpenAI IS distributing information they got wholesale
No, that ludicrous. How many complete JSTOR papers can I pull from ChatGPT? Zero? How many complete novels? None? Short stories? Also none? Can I ask for any of a category of items and get any of them? Nope. I cannot.
It's extremely hard to even get a complete decent sized paragraph from any work, and almost certainly not one you pre-select at will (most of those anyone produces are found by running massive search runs, then post selecting any matches).
Go ahead and demonstrate some wholesale distribution - pick an author and reproduce a few works, for example. I'll wait.
How many could I get from what Schwartz downloaded? Millions? Not just even as text - I could have gotten the complete author formatted layout, diagrams, everything, in perfect photo ready copy.
You're being dishonest in claiming these are the same. One can feel sad for Schwartz outcome, realize he was breaking the law, and realizing the current OpenAI copyright situation is likely unlike any previous copyright situation all at the same time. No need to equate such different things.
> If your "fair use" substantially negatively affects the market for the original source material, which I think is fairly clear in this case, the courts wont look favorably on that.
I think it's fairly clear that it doesn't. No one is going to use ChatGPT to circumvent NYTimes paywalls when archive.ph and the NoPaywall browser extension exist and any copyright violations would be on the publisher of ChatGPT's content.
But let's not pretend like any of us have any clue what's going to happen in this case. Even if Judge Alsup gets it, we're so far in uncharted territory any speculation is useless.
That would be like me just photocopying a book you wrote and then handing out copies saying we’re assigning different rights to the content. The whole point of the lawsuit is that OpenAI doesn’t own the content and thus they can’t just change the ownership rights per their terms of service. It doesn’t work like that.
Hacker News consistently have upvoted posts to let users circumvent paywalls. And even when it doesn't, conversations here (and on Twitter, Reddit, etc.) that summarize the articles and quote the relevant bits as soon as the articles are published are much more of a threat to The New York Times than ChatGPT training on articles from months/years ago.
In any case, the point is that they made no claim to Output (as opposed to their code, etc) being their IP.
I definitely agree with that (at least the "far in uncharted territory bit", but as far as "speculation being useless", we're all pretty much just analyzing/guessing/shooting the shit here, so I'm not sure "usefulness" is the right barometer), which is why I'm looking forward to this case, and I also totally agree the assessment is flexible.
But I don't think your argument that it doesn't negatively affect the market holds water. Courts have held in the past that the market for impact is pretty broadly defined, e.g.
> For example, in one case an artist used a copyrighted photograph without permission as the basis for wood sculptures, copying all elements of the photo. The artist earned several hundred thousand dollars selling the sculptures. When the photographer sued, the artist claimed his sculptures were a fair use because the photographer would never have considered making sculptures. The court disagreed, stating that it did not matter whether the photographer had considered making sculptures; what mattered was that a potential market for sculptures of the photograph existed. (Rogers v. Koons, 960 F.2d 301 (2d Cir. 1992).)
From https://fairuse.stanford.edu/overview/fair-use/four-factors/
Here's a hypothetical: suppose there is a random fact about some news event that has only been reported in a single article. Do they suddenly have a monopoly on that fact, and deserve compensation whenever that fact gets picked up and repeated by other news articles or books or TV shows or movies (or AI models)?
The end game when large content producers like The New York Times are squeezed due to copyright not being enforced is that they will become more draconian in their DRM measures. If you don't like paywalls now, watch out for what happens if a free-for-all is allowed for model training on copyrighted works without monetary compensation.
I had a similar conversation with my brother-in-law who's an economist by training, but now works in data science. Initially he was in the side of OpenAI, said that model training data is fair game. After probing him, he came to the same conclusion I describe: not enforcing copyright for model training data will just result in a tightening of free access to data.
We're already seeing it from the likes of Twitter/X and Reddit. That trend is likely to spread to more content-rich companies and get even more draconian as time goes on.
Unclear what that corpora might be, or if its the same books2 you are referring to.
I think you're nissing the point, and putting cart before horse. If you ensure that corporations are treated as stringently as people are sometimes, the reverse is true. And that means your goal will presumably be obtained, as the corporate might, becomes the little guy's win.
All with no unjust treatment.
This isn't even "fair use". The ideas in a work are simply not protected by copyright, only the form is.
Why do you say that? Search engines would at least direct the viewer to the source. NYT gets 35%+ of its traffic from Google: https://www.similarweb.com/website/nytimes.com/#traffic-sour...
https://www.zdnet.com/article/who-owns-the-code-if-chatgpts-...
>> OpenAI (the company behind ChatGPT) does not claim ownership of generated content. According to their terms of service, "OpenAI hereby assigns to you all its right, title and interest in and to Output."
How are they giving you the rights to the work if they don't own it? They are literally asserting that they are in a position to assign the rights (to the output) to the user - that is a literal claim of ownership.
IOW, if someone says "Take this from me, I assure you it is legal to do so", they are asserting ownership of that thing.
I find irony in the newspaper suing AI when other news sources (admittedly not NYT) use AI to write the articles. How many other AI scrapers are just ingesting AI generated content?
Eventually these LLMs are going to be put in mechanical bodies with the ability to interact with the world and learn (update their weights) in realtime. Consider how absurd your perspective would be then, when it'd be illegal for this embodied LLM to read any copyrighted text, be it a book or a web page, without special permission from the copyright holder, while humans face no such restriction.
Neither MIT nor JSTOR raised issue with what Schwartz did. JSTOR even went out of their way to tell the FBI they did not want him prosecuted.
Remember, again, with what he was charged. Wiretapping and intent to distribute. He wasn't charged with trespassing, breaking and entering, or anything else. Wiretapping and intent to distribute.
> His actions started to affect all users of JSTOR at MIT. The rate of outflow caused JSTOR to suffer performance, so JSTOR disabled all of MIT access.
And this is where you are confusing a "crime" with "misuse of a system". MIT and JSTOR were in their rights to cut access. That does not mean that what Schwartz did was illegal. Similar to how if a business owner tells you "you need to leave now" you aren't committing a crime because they asked you to leave. That doesn't happen until you are trespassed.
> Go ahead and demonstrate some wholesale distribution - pick an author and reproduce a few works, for example. I'll wait.
You violate copyright by transforming. And fortunately, it's really simple to show that chat GPT will violate and simply emit byte for byte chunks of copyrighted material.
You can, for example, ask it to implement Java's Array list and get several verbatim parts of the JDKs source code echoed back at you.
> How many could I get from what Schwartz downloaded?
0, because he didn't distribute.
Look at SpaceX. They paid a collective $0 to the individuals who discovered all the physics and engineering knowledge. Without that knowledge they're nothing. But still, aren't we all glad that SpaceX exists?
In exchange for all the knowledge that SpaceX is privatizing, we get to tax them. "You took from us, so we get to take it back with tax."
I think the more important consideration isn't fairness it's prosperity. I don't want to ruin the gravy train with IP and copyright law. Let them take everything, then tax the end output in order to correct the balance and make things right.
"Google Agrees to Pay Canadian Media for Using Their Content" - https://www.nytimes.com/2023/11/29/world/americas/google-can...
Sorry but if that’s the alternative to some writers feeling slighted, I’ll choose for the writers to be sad and the tech to be free.
I have no idea what on earth you are talking about. People and corporations are sued for copyright infringement all the time.
https://copyrightalliance.org/copyright-cases-2022/
Reading and consuming other people content isn't illegal, but it also wouldn't be for a computer.
Reading and consuming content with the sole purpose of reproducing it verbatim is frowned upon, and can be sued, whether it's an LLM or a sweatshop in India.
They're sued for _producing content_, not consuming content. If a human takes copyrighted output from an LLM and publishes it, they're absolutely liable if they violated copyright.
>Reading and consuming other people content isn't illegal, but it also wouldn't be for a computer.
That is absolutely what people in this thread are suggesting should happen: that it should be illegal for OpenAI et. al. to train models on publicly available content without first receiving permission from the authors.
>Reading and consuming content with the sole purpose of reproducing it verbatim is frowned upon, and can be sued, whether it's an LLM or a sweatshop in India.
That's irrelevant here because people training LLMs aren't feeding them copyrighted content for the sole purpose of reproducing it verbatim.
Facts are not subject to copyright. It's very obvious ChatGPT is more than a search engine regurgitating copies of pages it indexed.
Newspapers are very powerful and they own the platform to push their opinion. I'm not about to forget the EU debates where they all (or close to all) lied about how meta tags really work to push it their way, they've done it and they will do it again.
By your logic, Firefox is re-distributing content without permission from the copyright owners whenever you use it to read a pirated book. ChatGPT isn't just randomly generating copyrighted content, it just does so when explicitly prompted by a user.
You can do exactly the same with a human author or artist if you prompt them to. And if you decide to publish this material, you're the one liable for breach of copyright, not the person you instructed to create the material.
Of course, if the input I give to ChatGPT is "here is a piece from an NYT aricle, please tell it to me again verbatim", followed by a copy I got from the NYT archive, and ChatGPT is returning the same text I gave it as input, that is not copyright infringement. But if I say "please show me the text of the NYT article on crime from 10th January 1993", and ChatGPT returns the exact text of that article, then they are obviously infringing on NYT's distribution rights for this content, since they are retrieving it from their own storage.
If they returned a link you could click, t and retrieved the content from the NYT, along with any other changes such as advertising, even if it were inside an iframe, it would be an entirely different matter.
I guess it's more constructive to propose alternatives than just bashing the status quo. What's your creator compensation model for a search engine? I believe whatever being proposed is trading off something significant for being more ethic.
But it falls apart because kids aren't business units trained to maximize shareholder returns (maybe in the farming age they were). OpenAI isn't open, it's making revolutionary tools that are absolutely going to be monetized by the highest bidder. A quick way to test this is NYT offers to drop their case if "open" AI "open"-ly releases all its code and training data, they're just learning right? what's the harm?
The situations aren’t remotely similar and that much should be obvious. In one instance ChatGPT is reproducing copyrighted work and in the other Word is taking keyboard input from the user; Word itself isn’t producing anything itself.
> GPT is just a tool.
I don’t know what point this is supposed to make. It is not “just a tool” in the sense that it has no impact on what gets written.
Which brings us back to the beginning.
> the user who’s asking it to produce copyrighted content.
ChatGPT was trained on copyrighted content. The fact that it CAN reproduce the copyrighted content and the fact that it was trained on it is what the argument is about.
That's false; but even assuming it's true, misinformation is creative content and therefore 99% of the Internet is subject to copyright.
No, there is no clause in copyright law that says "unless someone remembered it all and copied it from their memory instead of directly from the original source." That would just be a different mechanism of copying.
Clean-room techniques are used so that if there is incidental replication of parts of code in the course of a reimplementation of existing software, that it can be proven it was not copied from the source work.
You can read the indictment, which I already suggested you do.
> Remember, again, with what he was charged. Wiretapping and intent to distribute. He wasn't charged with trespassing, breaking and entering, or anything else. Wiretapping and intent to distribute.
He wasn't charged with wiretapping (not even sure that's a generic crime). He was charged with (two counts of) wire fraud (18 USC 1343), a huge difference. He also had 5 different charges of computer fraud (18 USC 1030(a)(4), (b) & 2), 5 counts of unlawfully obtaining information from a protected computer (18 USC 1030 (a)(2), (b), (c)(2)(B)(iii) & 2), and 1 count of recklessly damaging a protected computer (18 USC...).
He was not charged with "intent to distribute", and there's not such thing as a "wiretapping" charge. Did you ever once read the actual indictment, or did you just make all this up from internet forum posts?
If you're going to start with the phrase "Remember, again.." you should try to make up nonsense. Actually read what you're asking others to "remember" which you apparently never knew in the first place.
> you are confusing a "crime" with "misuse of a system"
Apparently you are (willfully?) ignorant of law.
> You violate copyright by transforming.
That's false too. Transformative use is one defense used to not infringe copyright. Carefully read up on the topic.
> ask it to implement Java's Array list and get several verbatim parts of the JDKs source code echoed back at you
Provide the prompt. Courts have ruled that code that is the naïve way to create a simple solution is not copyrighted on it's own, so if you have only a few disconnected snippets, that violates nothing. Can you make it reproduce an entire source file, comments, legalese at the top? I doubt it. To violate copyright one needs a certain amount (determined by trials) of the content.
You might also want to make sure you're not simply reading OpenJDK.
> 0, because he didn't distribute.
Please read. "How many could I get from what Schwartz downloaded?" does not mean he published it all before he was stopped. It means what he took.
That you seem unable to tell the difference between someone copying millions of PDF to distribute as-is, and the effort one must go to to possibly get a desired copyrighted snippet, shows either dishonestly or ignorance of relevant laws.
See: Google turning off retention on internal conversations to avoid creating anti-trust evidence
books1 and books2 are OpenAI corpuses that have never (to my knowledge) had their content revealed.
books3 is public, developed outside of OpenAI and we know exactly what's in it.
If you give a bunch of books to a kid all by the same author and then pay that kid to write a book in a similar style and then I go on to sell that book...have I somehow infringed copyright?
The kids book at best is likely to be a very convincing facsimile of the original authors work...but not the authors work.
It seems to me that the only solution for artists is to charge for access to their work in a secure environment then lobotomise people on the way out.
The endgame seems to be "you can view and enjoy our work, but if you want to learn or be inspired by it, thats not on"
That isn't ironic at all, newspapers have newspaper competitors and if those competitors can steal content by washing it through an AI that is a serious problem. If these AI models weren't used to produce news articles and similar then it would be a much smaller issue.
In your example you owned the work you gave to the person to create derivatives of.
In a more accurate example you would be stealing those books and then giving them to someone else to create derivatives.
Artists that make easily reproducible art will circulate as these always have along with AI in a sea of other jpgs.
I envision pitting corporate body against corporate body, when one corporatism lobbies, works to (for example) extend copyrights, others will work to weaken copyright.
That doesn't happen as vigilantly currently, because there is no corporate incentive. They play the old, ask for forgiveness, rather than permission angle.
Anyhow. I just prefer to set my enemies against my enemies. More fun.
How about if I got the kid to read the books on a public website where the author made the books available for free?
If the work is unpublished for the purposes of the Copyright Act, you do have to register (or preregister) the work prior to the infringement. 17 USC § 412(1).
If the work is published, you still have to register it within the earlier of (a) three months after the first publication of the work or (b) one month after the copyright owner learns of the infringement.
See below for the actual text of the law.
Publication, for the purposes of the Copyright Act, generally means transferring or offering a copy of the work for sale or rental. But there are many cases where it’s not clear whether a work has or has not been published — most notably when a work is posted online and can be downloaded, but has not been explicitly offered for sale.
Also, the Supreme Court recently ruled that the mere filing of an application for registration is insufficient to file suit. The Register of Copyrights has to actually grant your application. The registration process typically takes many months, though you can pay $800 for expedited processing, if you need it.
~~~
Here is the relevant portion of the Copyright Act:
In any action under this title, other than an action brought for a violation of the rights of the author under section 106A(a), an action for infringement of the copyright of a work that has been preregistered under section 408(f) before the commencement of the infringement and that has an effective date of registration not later than the earlier of 3 months after the first publication of the work or 1 month after the copyright owner has learned of the infringement, or an action instituted under section 411(c), no award of statutory damages or of attorney’s fees, as provided by sections 504 and 505, shall be made for—
(1) any infringement of copyright in an unpublished work commenced before the effective date of its registration; or
(2) any infringement of copyright commenced after first publication of the work and before the effective date of its registration, unless such registration is made within three months after the first publication of the work.
a) In many closely comparable scenarios, yes, it’s copyright infringement. When Francis Ford Coppola made The Godfather film, he couldn’t just be “inspired” by Puzo’s book. If the story or characters or dialog are similar enough, he has to pay Puzo, even if the work he created was quite different and not a literal “copy”.
b) Training an LLM isn’t like giving someone a book. Among other things, it involves making a derivative copy into GPU memory. This copy is not a transitory copy in service of a fair use, nor likely a fair use in itself, nor licensed by the rights-holder.
So in general it is already as you say, corporations are much more targeted by these laws than individuals are. These laws mostly hinders corporations, us individuals are too small to be noticed by the system in most cases.
I've also seen indie games use copyrighted material with no issues, but AAA titles seem to avoid that like the plague. I can't really think of many examples where corporations are breaking these laws more than small individuals do.
The main objectors are the old guard monopolies that are threatened.
Training is almost certainly fair use, so it's exactly a transitory copy in service of fair use. Training, other than the brief "transitory copy" you mention is not copying, it's making a minuscule algorithmic adjustment based on fleeting exposure to the data.
Do you illegally share them via torrents or even sell copies of these works ?
Because that is what’s going on here?
If NYT were fully rellying on the argument that training a model in wordcraft using their materials is always copyright violation, or only had short quotes to point to, the philosophical debate you're trying to have would be more relevant.
Why?
As far as I understand, the copyright owner has control of all copying, regardless of whether it is done internally or externally. Distributing it externally would be a more serious vilation, though.
There are differences that are ethically, politically and in other ways between an AI doing something and a human doing the exact same thing. Those differences may need reflecting in new laws.
IANAL ans don’t have any positive suggestions for good laws, just pointing out that the analogy doesn’t quite hold. I think we’re in new territory where analogies to previous human activities aren’t always productive.
Try this with a real "kid" and you'll run into all kids of real-world constraints whereas flooding the world with derivative drivel using LLMs is something that's actually possible.
So yeah, stop using weak analogies, it's not helpful or intelligent.
Congress took the circuit holding in MAI Systems seriously enough to carve out a new fair use exception for copying software—entirely within the memory system of a licensed user—in service of debugging it.
If it took an act of Congress to make “unlicensed” debugging a fair use copy…
They use copyrighted material or they commit copyright infringement? The former doesn't necessarily constitute the latter. Likewise, given it's an option legally, there are other factors that go into the decision to use it that likely make it less attractive to AAA games.
https://libraries.emory.edu/research/copyright/copyright-dat...
Seems vastly transitory and since the output cannot be copyrighted, does no harm to any work it “trained” on.
I don't think that you can copyright a plot or story in any country can you?
If he re-wrote the story with different characters and different lines he wouldn't have had to to pay Puzo. I'm sure it would have been frowned upon if its too close, but legally ok.
Saying what's legal is irrelevant is an odd take.
I like living in a place with a rule of law.
You misread the post I was responding to. They were suggesting health data with PII removed.
Second, LLMs have proved that AI which gets unlimited training data can provide breakthroughs in AI capabilities. But they are not the whole universe of AIs. Some other AI tool, distinct from LLMs, which ingests en masse as much health data as it can could provide health and human longevity outcomes which could outweigh an individual's right to privacy.
If transformers can benefit from scale, why not some other, existing or yet to be found, AI technology?
We should be supporting a Common Crawl for health records, digitizing old health records, and shaming/forcing hospitals, research labs, and clinics into submitting all their data for a future AI to wade into and understand.
Disagree, it is completely relevant when discussing computers Vs people, the bar that has already been set is alternative uses.
LLMs don't have a purpose outside of regurgitating what it has ingested. CD burners at least could be claimed they were backing up your data.
If Microsoft truly believes that the trained output doesn't violate copyright then it should be forced to prove that by training it on all its internal source code, including Windows.
If that’s the case, let’s put it on the ballet and vote for it.
I’m tired of big tech making policy decisions by “asking for permission later” and getting away with everything.
If there truly is some breakthrough and all we need is everyone’s data, tell the population and sell it to the people and let’s vote on it!
The tech can either run freely in a box under my desk or I’ll have to pay upwards of 15-20k a year to run it on Adobes/Google/etcs servers. Once the tech is locked up it will skyrocket to AutoCAD type pricing because the acceleration it provides is too much.
Journos can weep, small price to pay for the tech being free for us all.
If it disgorges parts of NYT articles, how do we know this is not a common phrase, or the article isn't referenced verbatim on another, unpaid site?
I agree that if it uses the whole content of their articles for training, then NYT should get paid, but I'm not sure that they specifically trained on "paid NYT articles" as a topic, though I'm happy to be corrected.
I also think that companies and authors extremely overvalue the tiny fragments of their work in the huge pool of training data, I think there's a bit of a "main character" vibe going on.
Maybe, but I find the "It's ok to break the law because otherwise I can't do what I want" narrative a little offputting.
To me this says that openai would have access to ill-gotten raw patient data and would do the PII stripping themselves.
Certainly not in the US. From the article you linked "In the United States, in the absence of a TDM exception, AI companies contend that inclusion of copyrighted materials in training sets constitute fair use eg not copyright infringement, which position remains to be evaluated by the courts."
Fair use is a defense against copyright infringement, but the whole question in the first place is whether generative AI training falls under fair use, and this case looks to be the biggest test of that (among others filed relatively recently).
> If that’s the case, let’s put it on the ballet and vote for it.
This vote will mean "faster horses" for everyone. Exponential progress by committee is almost unheard of.
In the case of slavery - we changed the law.
In the case of copyright - it's older than the Atlantic Slave Trade and still alive and kicking.
It's almost as if one of them is not like the other.
Use this newfound insight to take my comment in good faith, as per HN guidelines, and recognize that I am making a generalized analogy about the gap between law and ethics, and not making a direct comparison between copyright and slavery.
Can we get back on topic?