zlacker

It's not clear that "data mining" covers this use. These models are huge, big enough that they can just contain direct copies of copyrighted works. They've been shown to reproduce them relatively easily. The argument is that they've actually generalized enough or learned enough that they're now no longer the sum of the dataset. I can definitely see that being possible but the way the technology works it's really hard to know if that has happened or if what's happening instead is a bunch of copyright washing.

There are some things that would make for good faith displays by the players in the space. For example, Microsoft has been investing a lot and yet their code offering is not trained on their internal code base. Same for Google. Start by doing that and I'll entertain the argument that your tools are fair use or data mining.

replies(1): >>nieman+34

>>pedroc+(OP)
My reading of the relevant laws would actually lead me to believe that this is not a problem, as long as those reproductions are not returned and the eights holder did not opt out. But courts might decide differently.

Regarding the copyright of returned material here is a good discussion:

https://copyrightblog.kluweriplaw.com/2023/05/09/generative-...

replies(2): >>JumpCr+v9 >>pedroc+jr

>>nieman+34
> as long as those reproductions are not returned

That’s the author’s entire gripe. Brave reproduced a Wikipedia entry without attribution and then slapped a copyright on it to boot.

>>nieman+34
That's clearly not enough. There's a continuum between producing exact input copy and having genuine creativity because the model actually learned something. A model that just reformats code and changes all the variable names would pass your test and yet be clearly a copyright violation. This whole argument requires that the neural network weights do something creative because they learned from the code instead of just transforming it. We're even careful about this with humans with things like clean room reimplementations to make sure.

replies(1): >>hakfoo+w31

>>pedroc+jr
The window of possible actual creativity may be limited and variable.

There are a lot of pretty complex prompts, where if you asked a group of reasonably skilled programmers to implement, they'd produce code that was "reformatted and changed variable names" but otherwise identical. Many of us learned from the same foundational materials, and there are only a handful of non-pathological ways to implement a linked list of integers, for example.

With code it may be more obvious, in that you can't as easily obfuscate things with synonyms and sentence structure changes. Even with prose, there is going to be a tendency to use "conventional" language choices, driving you back towards a familiar-looking mean.