It is clear OpenAI or Google did not use only Common Crawl. With so many press conferences why did no research journalist ask yet from OpenAI or Google to confirm or deny if they use or used LibGen?
Did OpenAI really bought an ebook of every publication from Cambridge Press, Oxford Press, Manning, APress, and so on? Did any of investors due diligence, include researching the legality of the content used for training?
If there's legally-murky secret data sauce, it's firewalled from being easily seen in its entirety by anyone not golden-handcuffed to the company.
They may be able to train against it. They may be able to peek at portions of it. But no one is downloading-all.
It's really disgusting, IMO, that corporations that go above and beyond that sort of behavior are seeing NO federal investigations for this sort of behavior. Yet a private citizen does it and it's threats of life in prison.
This isn't new, but it speaks to a major hole in our legal system and the administration of it. The Feds are more than willing to steamroll an individual but will think twice over investigating a large corporation engaged in the same behavior.
Edit: same applies to humans. Just because a healthcare company puts up a S3 bucket with patient health data with “robots: *” doesn’t give you a right to view or use the crawled patient data. In fact, redistributing it may land you in significant legal trouble. Something being crawlable doesn’t provide elevated rights compared to something not crawlable.
He also did not distribute the information wholesale. What he planned on doing with the information was never proven.
OpenAI IS distributing information they got wholesale from the internet without license to that information. Heck, they are selling the information they distribute.
If OpenAI got their hands on an S3 bucket from Aetna (or any major insurer) with full and complete health records on every American, due to Aetna lacking security or leaking a S3 bucket, should OpenAI or any other LLM provider be allowed to use the data in its training even if they strip out patient names before feeding it into training?
The difference between this question or NYT articles is that this question asks about content we know should not be available publicly online (even though it is or was at some point in the past).
I guess this really gets at “do we care about how the training data was obtained or pre-processed, or do we only care about the output (a model’s weights and numbers, etc)
Google may simply have been obliged to follow suit.
Personally, I’m looking forward to pirate LLMs trained on academic content.
If somehow it could be proven without doubt that deanonymising that data wasn't possible (which cannot be done), then the harm probably wouldn't be very big aside from just general data ownership concerns which are already being discussed.
Unequivocally, yes.
LLMs have proved themselves to be useful, at times, very useful, sometimes invaluable assistants who work in different ways than us. If sticking health data into a training set for some other AI could create another class of AI which can augment humanity, great!! Patient privacy and the law can f*k off.
I’m all for the greater good.
You actually need a lot more than that. Most significantly, you need to have registered the work with the Copyright Office.
“No civil action for infringement of the copyright in any United States work shall be instituted until ... registration of the copyright claim has been made in accordance with this title.” 17 USC §411(a).
Copyright is granted to the creator upon creation.
That right ended when he used it to break the law. It was also for use on MIT computers, not for remote access (which is why he decided to install the laptop, also knowing this was against his "right to use").
The "right to use" also included a warning that misuse could result in state and federal prosecutions. It was not some free for all.
> and pull the JSTR information
No, he did not have the right to pull en masse. The JSTOR access explicitly disallowed that. So he most certainly did not have the "right" to do that, even if he were sitting at MIT in an office not breaking into systems.
> did it in a shady way
The word you're looking for is "illegal." Breaking and entering is not simply shady - it's illegal and against the law. B&E with intent to commit a felony (which is what he was doing) is an even more serious crime, and one of the charges.
> he did it that way because he didn't want someone stealing or unplugging his laptop
Ah, the old "ends justifies break the law" argument.
Now, to be precise, MIT and JSTOR went to great lengths to stop the outflow of copying, which both saw. Schwartz returned multiple times to devise workarounds, continuing to break laws and circumvent yet more security measures. This was not some simply plug and forget laptop. He continually and persistently engaged in hacking to get around the protections both MIT and JSTOR were putting in place to stop him. He added a second computer, he used MAC spoofing, among other things. His actions started to affect all users of JSTOR at MIT. The rate of outflow caused JSTOR to suffer performance, so JSTOR disabled all of MIT access.
Go read the indictment and evidence.
> OpenAI IS distributing information they got wholesale
No, that ludicrous. How many complete JSTOR papers can I pull from ChatGPT? Zero? How many complete novels? None? Short stories? Also none? Can I ask for any of a category of items and get any of them? Nope. I cannot.
It's extremely hard to even get a complete decent sized paragraph from any work, and almost certainly not one you pre-select at will (most of those anyone produces are found by running massive search runs, then post selecting any matches).
Go ahead and demonstrate some wholesale distribution - pick an author and reproduce a few works, for example. I'll wait.
How many could I get from what Schwartz downloaded? Millions? Not just even as text - I could have gotten the complete author formatted layout, diagrams, everything, in perfect photo ready copy.
You're being dishonest in claiming these are the same. One can feel sad for Schwartz outcome, realize he was breaking the law, and realizing the current OpenAI copyright situation is likely unlike any previous copyright situation all at the same time. No need to equate such different things.
Unclear what that corpora might be, or if its the same books2 you are referring to.
I think you're nissing the point, and putting cart before horse. If you ensure that corporations are treated as stringently as people are sometimes, the reverse is true. And that means your goal will presumably be obtained, as the corporate might, becomes the little guy's win.
All with no unjust treatment.
Neither MIT nor JSTOR raised issue with what Schwartz did. JSTOR even went out of their way to tell the FBI they did not want him prosecuted.
Remember, again, with what he was charged. Wiretapping and intent to distribute. He wasn't charged with trespassing, breaking and entering, or anything else. Wiretapping and intent to distribute.
> His actions started to affect all users of JSTOR at MIT. The rate of outflow caused JSTOR to suffer performance, so JSTOR disabled all of MIT access.
And this is where you are confusing a "crime" with "misuse of a system". MIT and JSTOR were in their rights to cut access. That does not mean that what Schwartz did was illegal. Similar to how if a business owner tells you "you need to leave now" you aren't committing a crime because they asked you to leave. That doesn't happen until you are trespassed.
> Go ahead and demonstrate some wholesale distribution - pick an author and reproduce a few works, for example. I'll wait.
You violate copyright by transforming. And fortunately, it's really simple to show that chat GPT will violate and simply emit byte for byte chunks of copyrighted material.
You can, for example, ask it to implement Java's Array list and get several verbatim parts of the JDKs source code echoed back at you.
> How many could I get from what Schwartz downloaded?
0, because he didn't distribute.
Facts are not subject to copyright. It's very obvious ChatGPT is more than a search engine regurgitating copies of pages it indexed.
That's false; but even assuming it's true, misinformation is creative content and therefore 99% of the Internet is subject to copyright.
You can read the indictment, which I already suggested you do.
> Remember, again, with what he was charged. Wiretapping and intent to distribute. He wasn't charged with trespassing, breaking and entering, or anything else. Wiretapping and intent to distribute.
He wasn't charged with wiretapping (not even sure that's a generic crime). He was charged with (two counts of) wire fraud (18 USC 1343), a huge difference. He also had 5 different charges of computer fraud (18 USC 1030(a)(4), (b) & 2), 5 counts of unlawfully obtaining information from a protected computer (18 USC 1030 (a)(2), (b), (c)(2)(B)(iii) & 2), and 1 count of recklessly damaging a protected computer (18 USC...).
He was not charged with "intent to distribute", and there's not such thing as a "wiretapping" charge. Did you ever once read the actual indictment, or did you just make all this up from internet forum posts?
If you're going to start with the phrase "Remember, again.." you should try to make up nonsense. Actually read what you're asking others to "remember" which you apparently never knew in the first place.
> you are confusing a "crime" with "misuse of a system"
Apparently you are (willfully?) ignorant of law.
> You violate copyright by transforming.
That's false too. Transformative use is one defense used to not infringe copyright. Carefully read up on the topic.
> ask it to implement Java's Array list and get several verbatim parts of the JDKs source code echoed back at you
Provide the prompt. Courts have ruled that code that is the naïve way to create a simple solution is not copyrighted on it's own, so if you have only a few disconnected snippets, that violates nothing. Can you make it reproduce an entire source file, comments, legalese at the top? I doubt it. To violate copyright one needs a certain amount (determined by trials) of the content.
You might also want to make sure you're not simply reading OpenJDK.
> 0, because he didn't distribute.
Please read. "How many could I get from what Schwartz downloaded?" does not mean he published it all before he was stopped. It means what he took.
That you seem unable to tell the difference between someone copying millions of PDF to distribute as-is, and the effort one must go to to possibly get a desired copyrighted snippet, shows either dishonestly or ignorance of relevant laws.
See: Google turning off retention on internal conversations to avoid creating anti-trust evidence
books1 and books2 are OpenAI corpuses that have never (to my knowledge) had their content revealed.
books3 is public, developed outside of OpenAI and we know exactly what's in it.
If you give a bunch of books to a kid all by the same author and then pay that kid to write a book in a similar style and then I go on to sell that book...have I somehow infringed copyright?
The kids book at best is likely to be a very convincing facsimile of the original authors work...but not the authors work.
It seems to me that the only solution for artists is to charge for access to their work in a secure environment then lobotomise people on the way out.
The endgame seems to be "you can view and enjoy our work, but if you want to learn or be inspired by it, thats not on"
In your example you owned the work you gave to the person to create derivatives of.
In a more accurate example you would be stealing those books and then giving them to someone else to create derivatives.
Artists that make easily reproducible art will circulate as these always have along with AI in a sea of other jpgs.
I envision pitting corporate body against corporate body, when one corporatism lobbies, works to (for example) extend copyrights, others will work to weaken copyright.
That doesn't happen as vigilantly currently, because there is no corporate incentive. They play the old, ask for forgiveness, rather than permission angle.
Anyhow. I just prefer to set my enemies against my enemies. More fun.
How about if I got the kid to read the books on a public website where the author made the books available for free?
If the work is unpublished for the purposes of the Copyright Act, you do have to register (or preregister) the work prior to the infringement. 17 USC § 412(1).
If the work is published, you still have to register it within the earlier of (a) three months after the first publication of the work or (b) one month after the copyright owner learns of the infringement.
See below for the actual text of the law.
Publication, for the purposes of the Copyright Act, generally means transferring or offering a copy of the work for sale or rental. But there are many cases where it’s not clear whether a work has or has not been published — most notably when a work is posted online and can be downloaded, but has not been explicitly offered for sale.
Also, the Supreme Court recently ruled that the mere filing of an application for registration is insufficient to file suit. The Register of Copyrights has to actually grant your application. The registration process typically takes many months, though you can pay $800 for expedited processing, if you need it.
~~~
Here is the relevant portion of the Copyright Act:
In any action under this title, other than an action brought for a violation of the rights of the author under section 106A(a), an action for infringement of the copyright of a work that has been preregistered under section 408(f) before the commencement of the infringement and that has an effective date of registration not later than the earlier of 3 months after the first publication of the work or 1 month after the copyright owner has learned of the infringement, or an action instituted under section 411(c), no award of statutory damages or of attorney’s fees, as provided by sections 504 and 505, shall be made for—
(1) any infringement of copyright in an unpublished work commenced before the effective date of its registration; or
(2) any infringement of copyright commenced after first publication of the work and before the effective date of its registration, unless such registration is made within three months after the first publication of the work.
a) In many closely comparable scenarios, yes, it’s copyright infringement. When Francis Ford Coppola made The Godfather film, he couldn’t just be “inspired” by Puzo’s book. If the story or characters or dialog are similar enough, he has to pay Puzo, even if the work he created was quite different and not a literal “copy”.
b) Training an LLM isn’t like giving someone a book. Among other things, it involves making a derivative copy into GPU memory. This copy is not a transitory copy in service of a fair use, nor likely a fair use in itself, nor licensed by the rights-holder.
So in general it is already as you say, corporations are much more targeted by these laws than individuals are. These laws mostly hinders corporations, us individuals are too small to be noticed by the system in most cases.
I've also seen indie games use copyrighted material with no issues, but AAA titles seem to avoid that like the plague. I can't really think of many examples where corporations are breaking these laws more than small individuals do.
Training is almost certainly fair use, so it's exactly a transitory copy in service of fair use. Training, other than the brief "transitory copy" you mention is not copying, it's making a minuscule algorithmic adjustment based on fleeting exposure to the data.
If NYT were fully rellying on the argument that training a model in wordcraft using their materials is always copyright violation, or only had short quotes to point to, the philosophical debate you're trying to have would be more relevant.
Why?
As far as I understand, the copyright owner has control of all copying, regardless of whether it is done internally or externally. Distributing it externally would be a more serious vilation, though.
There are differences that are ethically, politically and in other ways between an AI doing something and a human doing the exact same thing. Those differences may need reflecting in new laws.
IANAL ans don’t have any positive suggestions for good laws, just pointing out that the analogy doesn’t quite hold. I think we’re in new territory where analogies to previous human activities aren’t always productive.
Try this with a real "kid" and you'll run into all kids of real-world constraints whereas flooding the world with derivative drivel using LLMs is something that's actually possible.
So yeah, stop using weak analogies, it's not helpful or intelligent.
Congress took the circuit holding in MAI Systems seriously enough to carve out a new fair use exception for copying software—entirely within the memory system of a licensed user—in service of debugging it.
If it took an act of Congress to make “unlicensed” debugging a fair use copy…
They use copyrighted material or they commit copyright infringement? The former doesn't necessarily constitute the latter. Likewise, given it's an option legally, there are other factors that go into the decision to use it that likely make it less attractive to AAA games.
https://libraries.emory.edu/research/copyright/copyright-dat...
Seems vastly transitory and since the output cannot be copyrighted, does no harm to any work it “trained” on.
I don't think that you can copyright a plot or story in any country can you?
If he re-wrote the story with different characters and different lines he wouldn't have had to to pay Puzo. I'm sure it would have been frowned upon if its too close, but legally ok.
You misread the post I was responding to. They were suggesting health data with PII removed.
Second, LLMs have proved that AI which gets unlimited training data can provide breakthroughs in AI capabilities. But they are not the whole universe of AIs. Some other AI tool, distinct from LLMs, which ingests en masse as much health data as it can could provide health and human longevity outcomes which could outweigh an individual's right to privacy.
If transformers can benefit from scale, why not some other, existing or yet to be found, AI technology?
We should be supporting a Common Crawl for health records, digitizing old health records, and shaming/forcing hospitals, research labs, and clinics into submitting all their data for a future AI to wade into and understand.
If Microsoft truly believes that the trained output doesn't violate copyright then it should be forced to prove that by training it on all its internal source code, including Windows.
If that’s the case, let’s put it on the ballet and vote for it.
I’m tired of big tech making policy decisions by “asking for permission later” and getting away with everything.
If there truly is some breakthrough and all we need is everyone’s data, tell the population and sell it to the people and let’s vote on it!
If it disgorges parts of NYT articles, how do we know this is not a common phrase, or the article isn't referenced verbatim on another, unpaid site?
I agree that if it uses the whole content of their articles for training, then NYT should get paid, but I'm not sure that they specifically trained on "paid NYT articles" as a topic, though I'm happy to be corrected.
I also think that companies and authors extremely overvalue the tiny fragments of their work in the huge pool of training data, I think there's a bit of a "main character" vibe going on.
To me this says that openai would have access to ill-gotten raw patient data and would do the PII stripping themselves.
> If that’s the case, let’s put it on the ballet and vote for it.
This vote will mean "faster horses" for everyone. Exponential progress by committee is almost unheard of.