zlacker

For all the leaks on: Secret projects, novelty training algorithms not being published anymore so as to preserve market share, custom hardware, Q* learning, internal politics at companies at the forefront of state of the art LLMs...A thunderous silence is the lack of leaks, on the exact datasets used to train the main commercial LLMs.

It is clear OpenAI or Google did not use only Common Crawl. With so many press conferences why did no research journalist ask yet from OpenAI or Google to confirm or deny if they use or used LibGen?

Did OpenAI really bought an ebook of every publication from Cambridge Press, Oxford Press, Manning, APress, and so on? Did any of investors due diligence, include researching the legality of the content used for training?

replies(6): >>ethbr1+92 >>alfied+S6 >>cogman+Z6 >>lumost+Td >>devind+Gg >>hhsect+3Q

>>belter+(OP)
Would be fascinated to hear from someone inside on a throwaway, but my nearest experience is that corporate lawyers aren't stupid.

If there's legally-murky secret data sauce, it's firewalled from being easily seen in its entirety by anyone not golden-handcuffed to the company.

They may be able to train against it. They may be able to peek at portions of it. But no one is downloading-all.

replies(1): >>belter+B4

>>ethbr1+92
Big corporations and corporate lawyers lose major lawsuits all the time.

replies(1): >>ethbr1+pK

>>belter+(OP)
Why isn't robots.txt enough to enforce copyright etc? If NYT didn't set robots.txt properly, is their content free-for-all? Yes I know the first answer you would jump to is "of course not, copyright is the default", but it's almost 2024 and we have had robots.txt as industry de jure to stop crawling.

replies(4): >>cj+g9 >>adrr+df >>trogdo+dg >>fennec+Gl3

>>belter+(OP)
We all remember when Aaron Swartz got hit with a wire tapping and intent to distribute federal crime for downloading JSTR stuff right?

It's really disgusting, IMO, that corporations that go above and beyond that sort of behavior are seeing NO federal investigations for this sort of behavior. Yet a private citizen does it and it's threats of life in prison.

This isn't new, but it speaks to a major hole in our legal system and the administration of it. The Feds are more than willing to steamroll an individual but will think twice over investigating a large corporation engaged in the same behavior.

replies(2): >>SideQu+eb >>hibiki+nf

>>alfied+S6
robots.txt is not meant to be a mechanism of communicating the licensing of content on the page being crawled nor is it meant to communicate how the crawled content is allowed to be used by the crawler.

Edit: same applies to humans. Just because a healthcare company puts up a S3 bucket with patient health data with “robots: *” doesn’t give you a right to view or use the crawled patient data. In fact, redistributing it may land you in significant legal trouble. Something being crawlable doesn’t provide elevated rights compared to something not crawlable.

replies(1): >>cj+Vc

>>cogman+Z6
Circumventing computer security to copy items en masse to distribute wholesale without transformation is a far cry from reading data on public facing web pages.

replies(1): >>cogman+ic

>>SideQu+eb
He didn't circumvent computer security. He had had a right to use the MIT network and pull the JSTR information. He certainly did it in a shady way (computer in a closet) but it's every bit as arguable that he did it that way because he didn't want someone stealing or unplugging his laptop while it was downloading the data.

He also did not distribute the information wholesale. What he planned on doing with the information was never proven.

OpenAI IS distributing information they got wholesale from the internet without license to that information. Heck, they are selling the information they distribute.

replies(2): >>SideQu+Lj >>anigbr+3y

>>cj+g9
Furthering the S3 health data thought exercise:

If OpenAI got their hands on an S3 bucket from Aetna (or any major insurer) with full and complete health records on every American, due to Aetna lacking security or leaking a S3 bucket, should OpenAI or any other LLM provider be allowed to use the data in its training even if they strip out patient names before feeding it into training?

The difference between this question or NYT articles is that this question asks about content we know should not be available publicly online (even though it is or was at some point in the past).

I guess this really gets at “do we care about how the training data was obtained or pre-processed, or do we only care about the output (a model’s weights and numbers, etc)

replies(2): >>stuart+ve >>Sai_+Ne

>>belter+(OP)
ChatGPTs birth as a research preview may have been an attempt to avoid these issues. It would have been unlikely to trigger legal anger for a free product which few use. When usage exploded, the natural inclination would be to hope for the best.

Google may simply have been obliged to follow suit.

Personally, I’m looking forward to pirate LLMs trained on academic content.

replies(1): >>graphe+VU

>>cj+Vc
HIPAA is about more than just names. Just information such as a patient's ZIP code and full medical history is often enough to de-anonymise someone. HIPAA breaches are considered much more severe than intellectual property infringements. I think the main reason that patients are considered to have ownership of even anonymised versions of their data (in terms of controlling how it is used) is that attempted anonymisation can fail, and there is always a risk of being deanonymised.

If somehow it could be proven without doubt that deanonymising that data wasn't possible (which cannot be done), then the harm probably wouldn't be very big aside from just general data ownership concerns which are already being discussed.

>>cj+Vc
> should [they] be allowed to use this data in training…?

Unequivocally, yes.

LLMs have proved themselves to be useful, at times, very useful, sometimes invaluable assistants who work in different ways than us. If sticking health data into a training set for some other AI could create another class of AI which can augment humanity, great!! Patient privacy and the law can f*k off.

I’m all for the greater good.

replies(1): >>davkan+pO

>>alfied+S6
Robot.txt isn't about copyrights, its about preventing bots. Its effectively a EULA. Copyright law only goes into effect when you distribute the content you scrape. If you scraped New York times for your own LLM that you used internally and didn't distribute the results, there would be no copyright infringement.

replies(2): >>sam_lo+zk >>oldgra+Kc1

>>cogman+Z6
What happened to Aaron Swartz was terrible. I find that what he was doing was outright good. IMO the right reading isn't to make sure anyone doing something similar faces the same way, but to make the information far more free, whether it's a corporation using it or not. I don't want them to steamroll everyone equally here, but to not steamroll anyone.

replies(2): >>b112+sp >>verve_+AD

>>alfied+S6
>Why isn't robots.txt enough to enforce copyright

You actually need a lot more than that. Most significantly, you need to have registered the work with the Copyright Office.

“No civil action for infringement of the copyright in any United States work shall be instituted until ... registration of the copyright claim has been made in accordance with this title.” 17 USC §411(a).

replies(1): >>jacobl+Wg

>>belter+(OP)
for what it's worth, i asked altman directly and he denied using libgen or books2, but also deferred to murati and her team on specifics. but the Q&A wasn't recorded and they haven't answered my follow-ups.

replies(2): >>belter+np >>jprete+Oq

>>trogdo+dg
But the thing is, you can only bring the civil action forward after registering your claim but you need not register the claim before the infringement occurs.

Copyright is granted to the creator upon creation.

replies(1): >>trogdo+YW

>>cogman+ic
> right to use the MIT

That right ended when he used it to break the law. It was also for use on MIT computers, not for remote access (which is why he decided to install the laptop, also knowing this was against his "right to use").

The "right to use" also included a warning that misuse could result in state and federal prosecutions. It was not some free for all.

> and pull the JSTR information

No, he did not have the right to pull en masse. The JSTOR access explicitly disallowed that. So he most certainly did not have the "right" to do that, even if he were sitting at MIT in an office not breaking into systems.

> did it in a shady way

The word you're looking for is "illegal." Breaking and entering is not simply shady - it's illegal and against the law. B&E with intent to commit a felony (which is what he was doing) is an even more serious crime, and one of the charges.

> he did it that way because he didn't want someone stealing or unplugging his laptop

Ah, the old "ends justifies break the law" argument.

Now, to be precise, MIT and JSTOR went to great lengths to stop the outflow of copying, which both saw. Schwartz returned multiple times to devise workarounds, continuing to break laws and circumvent yet more security measures. This was not some simply plug and forget laptop. He continually and persistently engaged in hacking to get around the protections both MIT and JSTOR were putting in place to stop him. He added a second computer, he used MAC spoofing, among other things. His actions started to affect all users of JSTOR at MIT. The rate of outflow caused JSTOR to suffer performance, so JSTOR disabled all of MIT access.

Go read the indictment and evidence.

> OpenAI IS distributing information they got wholesale

No, that ludicrous. How many complete JSTOR papers can I pull from ChatGPT? Zero? How many complete novels? None? Short stories? Also none? Can I ask for any of a category of items and get any of them? Nope. I cannot.

It's extremely hard to even get a complete decent sized paragraph from any work, and almost certainly not one you pre-select at will (most of those anyone produces are found by running massive search runs, then post selecting any matches).

Go ahead and demonstrate some wholesale distribution - pick an author and reproduce a few works, for example. I'll wait.

How many could I get from what Schwartz downloaded? Millions? Not just even as text - I could have gotten the complete author formatted layout, diagrams, everything, in perfect photo ready copy.

You're being dishonest in claiming these are the same. One can feel sad for Schwartz outcome, realize he was breaking the law, and realizing the current OpenAI copyright situation is likely unlike any previous copyright situation all at the same time. No need to equate such different things.

replies(1): >>cogman+tu

>>adrr+df
Er... This is what all these lawsuits against LLMs are hoping to disprove

replies(1): >>jeremy+bG

>>devind+Gg
Really? Because the GPT-3 paper talks about "...two internet-based books corpora (Books1 and Books2)..." (see pages 8 and 9) - https://arxiv.org/pdf/2005.14165.pdf

Unclear what that corpora might be, or if its the same books2 you are referring to.

replies(1): >>simonw+6O

>>hibiki+nf
I don't want them to steamroll everyone equally here, but to not steamroll anyone.

I think you're nissing the point, and putting cart before horse. If you ensure that corporations are treated as stringently as people are sometimes, the reverse is true. And that means your goal will presumably be obtained, as the corporate might, becomes the little guy's win.

All with no unjust treatment.

replies(1): >>b112+aV

>>devind+Gg
Why would he know the answer in the first place?

replies(1): >>Jensso+oS

>>SideQu+Lj
Ok, so a lot you've written but it comes down to this. What law did he break?

Neither MIT nor JSTOR raised issue with what Schwartz did. JSTOR even went out of their way to tell the FBI they did not want him prosecuted.

Remember, again, with what he was charged. Wiretapping and intent to distribute. He wasn't charged with trespassing, breaking and entering, or anything else. Wiretapping and intent to distribute.

> His actions started to affect all users of JSTOR at MIT. The rate of outflow caused JSTOR to suffer performance, so JSTOR disabled all of MIT access.

And this is where you are confusing a "crime" with "misuse of a system". MIT and JSTOR were in their rights to cut access. That does not mean that what Schwartz did was illegal. Similar to how if a business owner tells you "you need to leave now" you aren't committing a crime because they asked you to leave. That doesn't happen until you are trespassed.

> Go ahead and demonstrate some wholesale distribution - pick an author and reproduce a few works, for example. I'll wait.

You violate copyright by transforming. And fortunately, it's really simple to show that chat GPT will violate and simply emit byte for byte chunks of copyrighted material.

You can, for example, ask it to implement Java's Array list and get several verbatim parts of the JDKs source code echoed back at you.

> How many could I get from what Schwartz downloaded?

0, because he didn't distribute.

replies(1): >>SideQu+OJ

>>cogman+ic
OpenAI IS distributing information they got wholesale from the internet

Facts are not subject to copyright. It's very obvious ChatGPT is more than a search engine regurgitating copies of pages it indexed.

replies(1): >>tremon+OE

>>hibiki+nf
There are two points at issue here. One, that information should be more free, and two, that large corporations and private individuals should be equal before the law.

>>anigbr+3y
Facts are not subject to copyright

That's false; but even assuming it's true, misinformation is creative content and therefore 99% of the Internet is subject to copyright.

replies(1): >>anigbr+kD1

>>sam_lo+zk
Which lawsuits are concerning LLMs used only privately by the organization that developed it?

>>cogman+tu
> What law did he break?

You can read the indictment, which I already suggested you do.

> Remember, again, with what he was charged. Wiretapping and intent to distribute. He wasn't charged with trespassing, breaking and entering, or anything else. Wiretapping and intent to distribute.

He wasn't charged with wiretapping (not even sure that's a generic crime). He was charged with (two counts of) wire fraud (18 USC 1343), a huge difference. He also had 5 different charges of computer fraud (18 USC 1030(a)(4), (b) & 2), 5 counts of unlawfully obtaining information from a protected computer (18 USC 1030 (a)(2), (b), (c)(2)(B)(iii) & 2), and 1 count of recklessly damaging a protected computer (18 USC...).

He was not charged with "intent to distribute", and there's not such thing as a "wiretapping" charge. Did you ever once read the actual indictment, or did you just make all this up from internet forum posts?

If you're going to start with the phrase "Remember, again.." you should try to make up nonsense. Actually read what you're asking others to "remember" which you apparently never knew in the first place.

> you are confusing a "crime" with "misuse of a system"

Apparently you are (willfully?) ignorant of law.

> You violate copyright by transforming.

That's false too. Transformative use is one defense used to not infringe copyright. Carefully read up on the topic.

> ask it to implement Java's Array list and get several verbatim parts of the JDKs source code echoed back at you

Provide the prompt. Courts have ruled that code that is the naïve way to create a simple solution is not copyrighted on it's own, so if you have only a few disconnected snippets, that violates nothing. Can you make it reproduce an entire source file, comments, legalese at the top? I doubt it. To violate copyright one needs a certain amount (determined by trials) of the content.

You might also want to make sure you're not simply reading OpenJDK.

> 0, because he didn't distribute.

Please read. "How many could I get from what Schwartz downloaded?" does not mean he published it all before he was stopped. It means what he took.

That you seem unable to tell the difference between someone copying millions of PDF to distribute as-is, and the effort one must go to to possibly get a desired copyrighted snippet, shows either dishonestly or ignorance of relevant laws.

>>belter+B4
That doesn't mean they don't spend lots of time thinking of ways not to lose them.

See: Google turning off retention on internal conversations to avoid creating anti-trust evidence

>>belter+np
My guess is that this poster meant books3, not books2.

books1 and books2 are OpenAI corpuses that have never (to my knowledge) had their content revealed.

books3 is public, developed outside of OpenAI and we know exactly what's in it.

replies(1): >>devind+uue

>>Sai_+Ne
Eliminating the right to patient privacy does not serve the greater good. People have enough distrust of the medical system already. I’m ambivalent to training on properly anonymized health data but, i reject out of hand the idea that OpenAI et al should have unfettered access to identifiable private conversations between me and my doctor for the nebulous goal of some future improvement on llm models.

replies(1): >>Sai_+dZ1

>>belter+(OP)
I'm not for or against anything at this point until someone gets their balls out and clearly defines what copyright infringement means in this context.

If you give a bunch of books to a kid all by the same author and then pay that kid to write a book in a similar style and then I go on to sell that book...have I somehow infringed copyright?

The kids book at best is likely to be a very convincing facsimile of the original authors work...but not the authors work.

It seems to me that the only solution for artists is to charge for access to their work in a secure environment then lobotomise people on the way out.

The endgame seems to be "you can view and enjoy our work, but if you want to learn or be inspired by it, thats not on"

replies(7): >>flexth+YT >>graphe+uU >>twoodf+C01 >>soerxp+Qa1 >>incang+Gg1 >>OOPMan+Ok1 >>sulric+qL1

>>jprete+Oq
The legal liabilities of the training data they use in their flagship product seems to be a thing the CEO should know.

>>hhsect+3Q
I think you’re skipping over the problem.

In your example you owned the work you gave to the person to create derivatives of.

In a more accurate example you would be stealing those books and then giving them to someone else to create derivatives.

replies(1): >>slyall+5W

>>hhsect+3Q
Ironically these artists cant claim to be wholly original as they were certainly inspired. Artists that play live already "lobotomize" people on their way out since it's not easy to recreate an experience and a video isn't the same if it's a good show.

Artists that make easily reproducible art will circulate as these always have along with AI in a sea of other jpgs.

>>lumost+Td
Is there already a dataset? Before llama Facebook had one too I forgot what it was called.

>>b112+sp
Huh. I see downvotes. I am mystified, for if people and corporations are both treated stringently under the law, corporations will fight to have overly restrictive laws knocked down.

I envision pitting corporate body against corporate body, when one corporatism lobbies, works to (for example) extend copyrights, others will work to weaken copyright.

That doesn't happen as vigilantly currently, because there is no corporate incentive. They play the old, ask for forgiveness, rather than permission angle.

Anyhow. I just prefer to set my enemies against my enemies. More fun.

replies(1): >>Jensso+d11

>>flexth+YT
How about if I borrowed them from the library and gave them to the kid to read?

How about if I got the kid to read the books on a public website where the author made the books available for free?

>>jacobl+Wg
That is incorrect.

If the work is unpublished for the purposes of the Copyright Act, you do have to register (or preregister) the work prior to the infringement. 17 USC § 412(1).

If the work is published, you still have to register it within the earlier of (a) three months after the first publication of the work or (b) one month after the copyright owner learns of the infringement.

See below for the actual text of the law.

Publication, for the purposes of the Copyright Act, generally means transferring or offering a copy of the work for sale or rental. But there are many cases where it’s not clear whether a work has or has not been published — most notably when a work is posted online and can be downloaded, but has not been explicitly offered for sale.

Also, the Supreme Court recently ruled that the mere filing of an application for registration is insufficient to file suit. The Register of Copyrights has to actually grant your application. The registration process typically takes many months, though you can pay $800 for expedited processing, if you need it.

~~~

Here is the relevant portion of the Copyright Act:

In any action under this title, other than an action brought for a violation of the rights of the author under section 106A(a), an action for infringement of the copyright of a work that has been preregistered under section 408(f) before the commencement of the infringement and that has an effective date of registration not later than the earlier of 3 months after the first publication of the work or 1 month after the copyright owner has learned of the infringement, or an action instituted under section 411(c), no award of statutory damages or of attorney’s fees, as provided by sections 504 and 505, shall be made for—

(1) any infringement of copyright in an unpublished work commenced before the effective date of its registration; or

(2) any infringement of copyright commenced after first publication of the work and before the effective date of its registration, unless such registration is made within three months after the first publication of the work.

>>hhsect+3Q
There are two problems with the “kid” analogy:

a) In many closely comparable scenarios, yes, it’s copyright infringement. When Francis Ford Coppola made The Godfather film, he couldn’t just be “inspired” by Puzo’s book. If the story or characters or dialog are similar enough, he has to pay Puzo, even if the work he created was quite different and not a literal “copy”.

b) Training an LLM isn’t like giving someone a book. Among other things, it involves making a derivative copy into GPU memory. This copy is not a transitory copy in service of a fair use, nor likely a fair use in itself, nor licensed by the rights-holder.

replies(5): >>andy99+I41 >>PaulDa+on1 >>random+gE1 >>EarthM+wG1 >>fennec+8l3

>>b112+aV
Corporations follow these laws much more stringently than individuals. Individuals often use pirated software to make things, I've seen many examples of that. I've never seen a corporation use pirated software to make things, they pay for licenses. Maybe there is some rare cases, but pirating is mostly a thing individuals do not corporations.

So in general it is already as you say, corporations are much more targeted by these laws than individuals are. These laws mostly hinders corporations, us individuals are too small to be noticed by the system in most cases.

I've also seen indie games use copyrighted material with no issues, but AAA titles seem to avoid that like the plague. I can't really think of many examples where corporations are breaking these laws more than small individuals do.

replies(2): >>b112+Gf1 >>froggi+nn1

>>twoodf+C01
> This copy is not a transitory copy in service of a fair use

Training is almost certainly fair use, so it's exactly a transitory copy in service of fair use. Training, other than the brief "transitory copy" you mention is not copying, it's making a minuscule algorithmic adjustment based on fleeting exposure to the data.

replies(2): >>twoodf+Dm1 >>edwint+Ye2

>>hhsect+3Q
I don't have a comment on your hypothetical, but this case seems to go far beyond that. If you read the actual filing at the bottom of the linked page, NYT provides examples where ChatGPT recited exact multi-paragraph sections of their articles and tried to pass it off as its own words. Plainly reproducing a work is pretty much the only situation where "is this copyright violation?" isn't really in flux. It's not dissimilar to selling PDFs of copywritten books.

If NYT were fully rellying on the argument that training a model in wordcraft using their materials is always copyright violation, or only had short quotes to point to, the philosophical debate you're trying to have would be more relevant.

>>adrr+df
> If you scraped New York times for your own LLM that you used internally and didn't distribute the results, there would be no copyright infringement.

Why?

As far as I understand, the copyright owner has control of all copying, regardless of whether it is done internally or externally. Distributing it externally would be a more serious vilation, though.

>>Jensso+d11
So then you refute the comment I replied to, and its parent.

>>hhsect+3Q
Importantly, the kid- an individual human- got some wealth somewhat proportional to their effort. There’s non-trivial effort in recruiting the kid. We can’t clone the kid’s brain a million times and run it for pennies.

There are differences that are ethically, politically and in other ways between an AI doing something and a human doing the exact same thing. Those differences may need reflecting in new laws.

IANAL ans don’t have any positive suggestions for good laws, just pointing out that the analogy doesn’t quite hold. I think we’re in new territory where analogies to previous human activities aren’t always productive.

>>hhsect+3Q
I think your kid analogy is flawed because it ignores the fact that you couldn't reasonably use said "kid" to rapidly produce thousands of works in the same style and then go on to use them to flood the market and drown out the original authors presence.

Try this with a real "kid" and you'll run into all kids of real-world constraints whereas flooding the world with derivative drivel using LLMs is something that's actually possible.

So yeah, stop using weak analogies, it's not helpful or intelligent.

>>andy99+I41
Why is training “almost certainly” fair use?

Congress took the circuit holding in MAI Systems seriously enough to carve out a new fair use exception for copying software—entirely within the memory system of a licensed user—in service of debugging it.

If it took an act of Congress to make “unlicensed” debugging a fair use copy…

>>Jensso+d11
> I've also seen indie games use copyrighted material with no issues, but AAA titles seem to avoid that like the plague.

They use copyrighted material or they commit copyright infringement? The former doesn't necessarily constitute the latter. Likewise, given it's an option legally, there are other factors that go into the decision to use it that likely make it less attractive to AAA games.

>>twoodf+C01
Regarding (b) ... while a specific method of training that involved persistent copying may indeed be a violation, it is far from clear that the general notion of "send server request for URL, digest response in software that is not a browser" is automatically a violation. If there is deemed to be a difference (i.e. all you are allowed to do without a license is have a human read it in a browser), then one can see training mechanisms changing to accomodate that.

replies(1): >>twoodf+An1

>>PaulDa+on1
It’s all about the purpose the transitory copy serves. The mechanism doesn’t really matter, so you can’t make categorical claims about (say) non-browser requests.

>>tremon+OE
No it is not. You can make a better argument than just BSing.

https://libraries.emory.edu/research/copyright/copyright-dat...

>>twoodf+C01
>This copy is not a transitory copy in service of a fair use, nor likely a fair use in itself,

Seems vastly transitory and since the output cannot be copyrighted, does no harm to any work it “trained” on.

>>twoodf+C01
> If the story or characters or dialog are similar enough, he has to pay Puzo, even if the work he created was quite different and not a literal “copy”.

I don't think that you can copyright a plot or story in any country can you?

If he re-wrote the story with different characters and different lines he wouldn't have had to to pay Puzo. I'm sure it would have been frowned upon if its too close, but legally ok.

>>hhsect+3Q
you might be well served by reading the actual complaint.

>>davkan+pO
> unfettered access to identifiable private conversations

You misread the post I was responding to. They were suggesting health data with PII removed.

Second, LLMs have proved that AI which gets unlimited training data can provide breakthroughs in AI capabilities. But they are not the whole universe of AIs. Some other AI tool, distinct from LLMs, which ingests en masse as much health data as it can could provide health and human longevity outcomes which could outweigh an individual's right to privacy.

If transformers can benefit from scale, why not some other, existing or yet to be found, AI technology?

We should be supporting a Common Crawl for health records, digitizing old health records, and shaming/forcing hospitals, research labs, and clinics into submitting all their data for a future AI to wade into and understand.

replies(2): >>cj+pH2 >>davkan+1J3

>>andy99+I41
If you overtrain the model may include verbatim copies of your training material, and may be able to produce verbatim copies of the original in its output.

If Microsoft truly believes that the trained output doesn't violate copyright then it should be forced to prove that by training it on all its internal source code, including Windows.

>>Sai_+dZ1
> could outweigh an individual's right to privacy.

If that’s the case, let’s put it on the ballet and vote for it.

I’m tired of big tech making policy decisions by “asking for permission later” and getting away with everything.

If there truly is some breakthrough and all we need is everyone’s data, tell the population and sell it to the people and let’s vote on it!

replies(1): >>Sai_+Fo4

>>twoodf+C01
How is it a copy at all? Surely the model weights would therefore be much larger than the corpus of training data, which is not the case at all.

If it disgorges parts of NYT articles, how do we know this is not a common phrase, or the article isn't referenced verbatim on another, unpaid site?

I agree that if it uses the whole content of their articles for training, then NYT should get paid, but I'm not sure that they specifically trained on "paid NYT articles" as a topic, though I'm happy to be corrected.

I also think that companies and authors extremely overvalue the tiny fragments of their work in the huge pool of training data, I think there's a bit of a "main character" vibe going on.

>>alfied+S6
NYT seemed to claim paid subscriptions as well, which I'm not sure that bots can actually crawl.

>>Sai_+dZ1
> Furthering the S3 health data thought exercise: If OpenAI got their hands on an S3 bucket from Aetna (or any major insurer) with full and complete health records on every American, due to Aetna lacking security or leaking a S3 bucket, should OpenAI or any other LLM provider be allowed to use the data in its training even if they strip out patient names before feeding it into training?

To me this says that openai would have access to ill-gotten raw patient data and would do the PII stripping themselves.

>>cj+pH2
> I’m tired of big tech making policy decisions by “asking for permission later” and getting away with everything

> If that’s the case, let’s put it on the ballet and vote for it.

This vote will mean "faster horses" for everyone. Exponential progress by committee is almost unheard of.

>>simonw+6O
sorry, books3 is indeed what I meant.