zlacker

[return to "The shady world of Brave selling copyrighted data for AI training"]
1. 6gvONx+qs[view] [source] 2023-07-15 15:13:30
>>rand0m+(OP)
> Fair use is a doctrine in the law of the United States that allows limited use of copyrighted material without requiring permission from the rights holders. It provides for the legal, non-licensed citation or incorporation of copyrighted material in another author's work under a four-factor balancing test:

> 1) The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes

> 2) The nature of the copyrighted work

> 3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole

> 4) The effect of the use upon the potential market for or value of the copyrighted work

[emphasis from TFA]

HN always talks about derivative work and transformativeness, but never about these. The fourth one especially seems clear in its implications for models.

Regardless, it makes it seem much less clear cut than people here often say.

◧◩
2. beware+Av[view] [source] 2023-07-15 15:34:11
>>6gvONx+qs
Unpopular opinion time:

A ML model is clearly a derivative work of its input.

Here's what I think would be fair:

Anyone who holds copyright in something used as part of a training corpus is owed a proportional share of the cash flow resulting from use of the resulting models. (Cash flow, not profits, because it's too easy to use accounting tricks to make profits disappear).

In the case of intermediaries (e.g., social media like reddit & twitter) those intermediaries could take a cut before passing it on to the original authors.

Obviously hellishly difficult to administer so it's unlikely to happen but I don't see a better answer.

◧◩◪
3. mattbe+Iz[view] [source] 2023-07-15 15:54:19
>>beware+Av
I don't know what a fair settlement would be but I'm looking forward to a copyright-holder suing OpenAI to obtain one. These companies have no value if copyright can be enforced on their training data.
◧◩◪◨
4. visarg+oQ[view] [source] 2023-07-15 17:20:16
>>mattbe+Iz
I think there are ways around it. The simplest would be to generate replacement data, for example by paraphrasing the original, or summarising, or turning it into question-answer pairs. In this new format it can serve as training data for a clean LLM. Of course the public domain data would be used directly, no need to go synthetic there.

An important direction would be to train copyright attribution models, and diff-models to detect when a work is infringing on another, by direct comparison. They would be useful to filter both the training set and the model outputs.

◧◩◪◨⬒
5. mattbe+h61[view] [source] 2023-07-15 19:14:11
>>visarg+oQ
Would automated paraphrasing not be a derivative work of the original?
◧◩◪◨⬒⬓
6. visarg+4w2[view] [source] 2023-07-16 10:45:39
>>mattbe+h61
So you think any paraphrase of a copyrighted phrase is in copyright violation? That's like owning the idea itself. Is any utterance similar to this one now forbidden?
◧◩◪◨⬒⬓⬔
7. mattbe+XO3[view] [source] 2023-07-16 19:15:24
>>visarg+4w2
I think if you automate paraphrasing from an original work to use that original work on bulk somehow, yes.

How do you even automate paraphrasing without training it on lots of original work? It's infringement all the way down.

[go to top]