The New York Times is suing OpenAI and Microsoft for copyright infringement

>>ssgodd+(OP)
Solidly rooting for NYT on this - it’s felt like many creative organizations have been asleep at the wheel while their lunch gets eaten for a second time (the first being at the birth of modern search engines.)

I don’t necessarily fault OpenAI’s decision to initially train their models without entering into licensing agreements - they probably wouldn’t exist and the generative AI revolution may never have happened if they put the horse before the cart. I do think they should quickly course correct at this point and accept the fact that they clearly owe something to the creators of content they are consuming. If they don’t, they are setting themselves up for a bigger loss down the road and leaving the door open for a more established competitor (Google) to do it the right way.

>>kbos87+Na
For all the leaks on: Secret projects, novelty training algorithms not being published anymore so as to preserve market share, custom hardware, Q* learning, internal politics at companies at the forefront of state of the art LLMs...A thunderous silence is the lack of leaks, on the exact datasets used to train the main commercial LLMs.

It is clear OpenAI or Google did not use only Common Crawl. With so many press conferences why did no research journalist ask yet from OpenAI or Google to confirm or deny if they use or used LibGen?

Did OpenAI really bought an ebook of every publication from Cambridge Press, Oxford Press, Manning, APress, and so on? Did any of investors due diligence, include researching the legality of the content used for training?

>>belter+kl
Why isn't robots.txt enough to enforce copyright etc? If NYT didn't set robots.txt properly, is their content free-for-all? Yes I know the first answer you would jump to is "of course not, copyright is the default", but it's almost 2024 and we have had robots.txt as industry de jure to stop crawling.

>>alfied+cs
robots.txt is not meant to be a mechanism of communicating the licensing of content on the page being crawled nor is it meant to communicate how the crawled content is allowed to be used by the crawler.

Edit: same applies to humans. Just because a healthcare company puts up a S3 bucket with patient health data with “robots: *” doesn’t give you a right to view or use the crawled patient data. In fact, redistributing it may land you in significant legal trouble. Something being crawlable doesn’t provide elevated rights compared to something not crawlable.

>>cj+Au
Furthering the S3 health data thought exercise:

If OpenAI got their hands on an S3 bucket from Aetna (or any major insurer) with full and complete health records on every American, due to Aetna lacking security or leaking a S3 bucket, should OpenAI or any other LLM provider be allowed to use the data in its training even if they strip out patient names before feeding it into training?

The difference between this question or NYT articles is that this question asks about content we know should not be available publicly online (even though it is or was at some point in the past).

I guess this really gets at “do we care about how the training data was obtained or pre-processed, or do we only care about the output (a model’s weights and numbers, etc)

>>cj+fy
> should [they] be allowed to use this data in training…?

Unequivocally, yes.

LLMs have proved themselves to be useful, at times, very useful, sometimes invaluable assistants who work in different ways than us. If sticking health data into a training set for some other AI could create another class of AI which can augment humanity, great!! Patient privacy and the law can f*k off.

I’m all for the greater good.

>>Sai_+7A
Eliminating the right to patient privacy does not serve the greater good. People have enough distrust of the medical system already. I’m ambivalent to training on properly anonymized health data but, i reject out of hand the idea that OpenAI et al should have unfettered access to identifiable private conversations between me and my doctor for the nebulous goal of some future improvement on llm models.

>>davkan+J91
> unfettered access to identifiable private conversations

You misread the post I was responding to. They were suggesting health data with PII removed.

Second, LLMs have proved that AI which gets unlimited training data can provide breakthroughs in AI capabilities. But they are not the whole universe of AIs. Some other AI tool, distinct from LLMs, which ingests en masse as much health data as it can could provide health and human longevity outcomes which could outweigh an individual's right to privacy.

If transformers can benefit from scale, why not some other, existing or yet to be found, AI technology?

We should be supporting a Common Crawl for health records, digitizing old health records, and shaming/forcing hospitals, research labs, and clinics into submitting all their data for a future AI to wade into and understand.

>>Sai_+xk2
> could outweigh an individual's right to privacy.

If that’s the case, let’s put it on the ballet and vote for it.

I’m tired of big tech making policy decisions by “asking for permission later” and getting away with everything.

If there truly is some breakthrough and all we need is everyone’s data, tell the population and sell it to the people and let’s vote on it!

>>cj+J23
> I’m tired of big tech making policy decisions by “asking for permission later” and getting away with everything

> If that’s the case, let’s put it on the ballet and vote for it.

This vote will mean "faster horses" for everyone. Exponential progress by committee is almost unheard of.

zlacker