zlacker

Why isn't robots.txt enough to enforce copyright etc? If NYT didn't set robots.txt properly, is their content free-for-all? Yes I know the first answer you would jump to is "of course not, copyright is the default", but it's almost 2024 and we have had robots.txt as industry de jure to stop crawling.

replies(4): >>cj+o2 >>adrr+l8 >>trogdo+l9 >>fennec+Oe3

>>alfied+(OP)
robots.txt is not meant to be a mechanism of communicating the licensing of content on the page being crawled nor is it meant to communicate how the crawled content is allowed to be used by the crawler.

Edit: same applies to humans. Just because a healthcare company puts up a S3 bucket with patient health data with “robots: *” doesn’t give you a right to view or use the crawled patient data. In fact, redistributing it may land you in significant legal trouble. Something being crawlable doesn’t provide elevated rights compared to something not crawlable.

replies(1): >>cj+36

>>cj+o2
Furthering the S3 health data thought exercise:

If OpenAI got their hands on an S3 bucket from Aetna (or any major insurer) with full and complete health records on every American, due to Aetna lacking security or leaking a S3 bucket, should OpenAI or any other LLM provider be allowed to use the data in its training even if they strip out patient names before feeding it into training?

The difference between this question or NYT articles is that this question asks about content we know should not be available publicly online (even though it is or was at some point in the past).

I guess this really gets at “do we care about how the training data was obtained or pre-processed, or do we only care about the output (a model’s weights and numbers, etc)

replies(2): >>stuart+D7 >>Sai_+V7

>>cj+36
HIPAA is about more than just names. Just information such as a patient's ZIP code and full medical history is often enough to de-anonymise someone. HIPAA breaches are considered much more severe than intellectual property infringements. I think the main reason that patients are considered to have ownership of even anonymised versions of their data (in terms of controlling how it is used) is that attempted anonymisation can fail, and there is always a risk of being deanonymised.

If somehow it could be proven without doubt that deanonymising that data wasn't possible (which cannot be done), then the harm probably wouldn't be very big aside from just general data ownership concerns which are already being discussed.

>>cj+36
> should [they] be allowed to use this data in training…?

Unequivocally, yes.

LLMs have proved themselves to be useful, at times, very useful, sometimes invaluable assistants who work in different ways than us. If sticking health data into a training set for some other AI could create another class of AI which can augment humanity, great!! Patient privacy and the law can f*k off.

I’m all for the greater good.

replies(1): >>davkan+xH

>>alfied+(OP)
Robot.txt isn't about copyrights, its about preventing bots. Its effectively a EULA. Copyright law only goes into effect when you distribute the content you scrape. If you scraped New York times for your own LLM that you used internally and didn't distribute the results, there would be no copyright infringement.

replies(2): >>sam_lo+Hd >>oldgra+S51

>>alfied+(OP)
>Why isn't robots.txt enough to enforce copyright

You actually need a lot more than that. Most significantly, you need to have registered the work with the Copyright Office.

“No civil action for infringement of the copyright in any United States work shall be instituted until ... registration of the copyright claim has been made in accordance with this title.” 17 USC §411(a).

replies(1): >>jacobl+4a

>>trogdo+l9
But the thing is, you can only bring the civil action forward after registering your claim but you need not register the claim before the infringement occurs.

Copyright is granted to the creator upon creation.

replies(1): >>trogdo+6Q

>>adrr+l8
Er... This is what all these lawsuits against LLMs are hoping to disprove

replies(1): >>jeremy+jz

>>sam_lo+Hd
Which lawsuits are concerning LLMs used only privately by the organization that developed it?

>>Sai_+V7
Eliminating the right to patient privacy does not serve the greater good. People have enough distrust of the medical system already. I’m ambivalent to training on properly anonymized health data but, i reject out of hand the idea that OpenAI et al should have unfettered access to identifiable private conversations between me and my doctor for the nebulous goal of some future improvement on llm models.

replies(1): >>Sai_+lS1

>>jacobl+4a
That is incorrect.

If the work is unpublished for the purposes of the Copyright Act, you do have to register (or preregister) the work prior to the infringement. 17 USC § 412(1).

If the work is published, you still have to register it within the earlier of (a) three months after the first publication of the work or (b) one month after the copyright owner learns of the infringement.

See below for the actual text of the law.

Publication, for the purposes of the Copyright Act, generally means transferring or offering a copy of the work for sale or rental. But there are many cases where it’s not clear whether a work has or has not been published — most notably when a work is posted online and can be downloaded, but has not been explicitly offered for sale.

Also, the Supreme Court recently ruled that the mere filing of an application for registration is insufficient to file suit. The Register of Copyrights has to actually grant your application. The registration process typically takes many months, though you can pay $800 for expedited processing, if you need it.

~~~

Here is the relevant portion of the Copyright Act:

In any action under this title, other than an action brought for a violation of the rights of the author under section 106A(a), an action for infringement of the copyright of a work that has been preregistered under section 408(f) before the commencement of the infringement and that has an effective date of registration not later than the earlier of 3 months after the first publication of the work or 1 month after the copyright owner has learned of the infringement, or an action instituted under section 411(c), no award of statutory damages or of attorney’s fees, as provided by sections 504 and 505, shall be made for—

(1) any infringement of copyright in an unpublished work commenced before the effective date of its registration; or

(2) any infringement of copyright commenced after first publication of the work and before the effective date of its registration, unless such registration is made within three months after the first publication of the work.

>>adrr+l8
> If you scraped New York times for your own LLM that you used internally and didn't distribute the results, there would be no copyright infringement.

Why?

As far as I understand, the copyright owner has control of all copying, regardless of whether it is done internally or externally. Distributing it externally would be a more serious vilation, though.

>>davkan+xH
> unfettered access to identifiable private conversations

You misread the post I was responding to. They were suggesting health data with PII removed.

Second, LLMs have proved that AI which gets unlimited training data can provide breakthroughs in AI capabilities. But they are not the whole universe of AIs. Some other AI tool, distinct from LLMs, which ingests en masse as much health data as it can could provide health and human longevity outcomes which could outweigh an individual's right to privacy.

If transformers can benefit from scale, why not some other, existing or yet to be found, AI technology?

We should be supporting a Common Crawl for health records, digitizing old health records, and shaming/forcing hospitals, research labs, and clinics into submitting all their data for a future AI to wade into and understand.

replies(2): >>cj+xA2 >>davkan+9C3

>>Sai_+lS1
> could outweigh an individual's right to privacy.

If that’s the case, let’s put it on the ballet and vote for it.

I’m tired of big tech making policy decisions by “asking for permission later” and getting away with everything.

If there truly is some breakthrough and all we need is everyone’s data, tell the population and sell it to the people and let’s vote on it!

replies(1): >>Sai_+Nh4

>>alfied+(OP)
NYT seemed to claim paid subscriptions as well, which I'm not sure that bots can actually crawl.

>>Sai_+lS1
> Furthering the S3 health data thought exercise: If OpenAI got their hands on an S3 bucket from Aetna (or any major insurer) with full and complete health records on every American, due to Aetna lacking security or leaking a S3 bucket, should OpenAI or any other LLM provider be allowed to use the data in its training even if they strip out patient names before feeding it into training?

To me this says that openai would have access to ill-gotten raw patient data and would do the PII stripping themselves.

>>cj+xA2
> I’m tired of big tech making policy decisions by “asking for permission later” and getting away with everything

> If that’s the case, let’s put it on the ballet and vote for it.

This vote will mean "faster horses" for everyone. Exponential progress by committee is almost unheard of.