I don’t necessarily fault OpenAI’s decision to initially train their models without entering into licensing agreements - they probably wouldn’t exist and the generative AI revolution may never have happened if they put the horse before the cart. I do think they should quickly course correct at this point and accept the fact that they clearly owe something to the creators of content they are consuming. If they don’t, they are setting themselves up for a bigger loss down the road and leaving the door open for a more established competitor (Google) to do it the right way.
It is clear OpenAI or Google did not use only Common Crawl. With so many press conferences why did no research journalist ask yet from OpenAI or Google to confirm or deny if they use or used LibGen?
Did OpenAI really bought an ebook of every publication from Cambridge Press, Oxford Press, Manning, APress, and so on? Did any of investors due diligence, include researching the legality of the content used for training?
It's really disgusting, IMO, that corporations that go above and beyond that sort of behavior are seeing NO federal investigations for this sort of behavior. Yet a private citizen does it and it's threats of life in prison.
This isn't new, but it speaks to a major hole in our legal system and the administration of it. The Feds are more than willing to steamroll an individual but will think twice over investigating a large corporation engaged in the same behavior.
He also did not distribute the information wholesale. What he planned on doing with the information was never proven.
OpenAI IS distributing information they got wholesale from the internet without license to that information. Heck, they are selling the information they distribute.
That right ended when he used it to break the law. It was also for use on MIT computers, not for remote access (which is why he decided to install the laptop, also knowing this was against his "right to use").
The "right to use" also included a warning that misuse could result in state and federal prosecutions. It was not some free for all.
> and pull the JSTR information
No, he did not have the right to pull en masse. The JSTOR access explicitly disallowed that. So he most certainly did not have the "right" to do that, even if he were sitting at MIT in an office not breaking into systems.
> did it in a shady way
The word you're looking for is "illegal." Breaking and entering is not simply shady - it's illegal and against the law. B&E with intent to commit a felony (which is what he was doing) is an even more serious crime, and one of the charges.
> he did it that way because he didn't want someone stealing or unplugging his laptop
Ah, the old "ends justifies break the law" argument.
Now, to be precise, MIT and JSTOR went to great lengths to stop the outflow of copying, which both saw. Schwartz returned multiple times to devise workarounds, continuing to break laws and circumvent yet more security measures. This was not some simply plug and forget laptop. He continually and persistently engaged in hacking to get around the protections both MIT and JSTOR were putting in place to stop him. He added a second computer, he used MAC spoofing, among other things. His actions started to affect all users of JSTOR at MIT. The rate of outflow caused JSTOR to suffer performance, so JSTOR disabled all of MIT access.
Go read the indictment and evidence.
> OpenAI IS distributing information they got wholesale
No, that ludicrous. How many complete JSTOR papers can I pull from ChatGPT? Zero? How many complete novels? None? Short stories? Also none? Can I ask for any of a category of items and get any of them? Nope. I cannot.
It's extremely hard to even get a complete decent sized paragraph from any work, and almost certainly not one you pre-select at will (most of those anyone produces are found by running massive search runs, then post selecting any matches).
Go ahead and demonstrate some wholesale distribution - pick an author and reproduce a few works, for example. I'll wait.
How many could I get from what Schwartz downloaded? Millions? Not just even as text - I could have gotten the complete author formatted layout, diagrams, everything, in perfect photo ready copy.
You're being dishonest in claiming these are the same. One can feel sad for Schwartz outcome, realize he was breaking the law, and realizing the current OpenAI copyright situation is likely unlike any previous copyright situation all at the same time. No need to equate such different things.
Neither MIT nor JSTOR raised issue with what Schwartz did. JSTOR even went out of their way to tell the FBI they did not want him prosecuted.
Remember, again, with what he was charged. Wiretapping and intent to distribute. He wasn't charged with trespassing, breaking and entering, or anything else. Wiretapping and intent to distribute.
> His actions started to affect all users of JSTOR at MIT. The rate of outflow caused JSTOR to suffer performance, so JSTOR disabled all of MIT access.
And this is where you are confusing a "crime" with "misuse of a system". MIT and JSTOR were in their rights to cut access. That does not mean that what Schwartz did was illegal. Similar to how if a business owner tells you "you need to leave now" you aren't committing a crime because they asked you to leave. That doesn't happen until you are trespassed.
> Go ahead and demonstrate some wholesale distribution - pick an author and reproduce a few works, for example. I'll wait.
You violate copyright by transforming. And fortunately, it's really simple to show that chat GPT will violate and simply emit byte for byte chunks of copyrighted material.
You can, for example, ask it to implement Java's Array list and get several verbatim parts of the JDKs source code echoed back at you.
> How many could I get from what Schwartz downloaded?
0, because he didn't distribute.