zlacker

> Courts (at least in the US) have already ruled that use of ingested data for training is transformative

Yes, the training of the model itself is (or should be) a transformative act so you can train a model on whatever you have legal access to view.

However, that doesn't mean that the output of the model is automatically not infringing. If the model is prompted to create a copy of some copyrighted work, that is (or should be) still a violation.

Just like memorizing a book isn't infringment but reproducing a book from memory is.

replies(2): >>anp+K1 >>threec+v5

>>shkkmo+(OP)
This also matches my (not a lawyer) intuition, but have there been any legal precedents set in this direction yet?

>>shkkmo+(OP)
The fact that GitHub’s Copilot has an enterprise feature that matches model output against code having certain licenses - in order to prevent you from using it, with a notification - suggests the model outputs are at least potentially infringing.

If MS were compelled to reveal how these completions are generated, there’s at least a possibility that they directly use public repositories to source text chunks that their “model” suggested were relevant (quoted as it could be more than just a model, like vector or search databases or some other orchestration across multiple workloads).

replies(2): >>martin+QP >>ineeda+801

>>threec+v5
> directly use public repositories

I don't see why a company which has been waging a multi decade war against GPL and users' rights would stop at _public_ repositories.

>>threec+v5
> suggests the model outputs are at least potentially infringing.

The only thing it suggests is that they recognize that a subset of users worry about it. Whether or not GitHub worries about it any further isn’t suggested.

Don’t think about it from an actual “rights” perspective. Think about the entire copyright issue as a “too big to fail” issue.