AI tooling must be disclosed for contributions

>>freeto+(OP)
There is also IP taint when using "AI". We're just pretending that there's not.

If someone came to you and said "good news: I memorized the code of all the open source projects in this space, and can regurgitate it on command", you would be smart to ban them from working on code at your company.

But with "AI", we make up a bunch of rationalizations. ("I'm doing AI agentic generative AI workflow boilerplate 10x gettin it done AI did I say AI yet!")

And we pretend the person never said that they're just loosely laundering GPL and other code in a way that rightly would be existentially toxic to an IP-based company.

>>neilv+j6
Courts (at least in the US) have already ruled that use of ingested data for training is transformative. There’s lots of details to figure, but the genie is out of the bottle.

Sure it’s a big hill to climb in rethinking IP laws to align with a societal desire that generating IP continue to be a viable economic work product, but that is what’s necessary.

>>ineeda+5c
> Courts (at least in the US) have already ruled that use of ingested data for training is transformative

Yes, the training of the model itself is (or should be) a transformative act so you can train a model on whatever you have legal access to view.

However, that doesn't mean that the output of the model is automatically not infringing. If the model is prompted to create a copy of some copyrighted work, that is (or should be) still a violation.

Just like memorizing a book isn't infringment but reproducing a book from memory is.

>>shkkmo+eS
The fact that GitHub’s Copilot has an enterprise feature that matches model output against code having certain licenses - in order to prevent you from using it, with a notification - suggests the model outputs are at least potentially infringing.

If MS were compelled to reveal how these completions are generated, there’s at least a possibility that they directly use public repositories to source text chunks that their “model” suggested were relevant (quoted as it could be more than just a model, like vector or search databases or some other orchestration across multiple workloads).

>>threec+JX
> suggests the model outputs are at least potentially infringing.

The only thing it suggests is that they recognize that a subset of users worry about it. Whether or not GitHub worries about it any further isn’t suggested.

Don’t think about it from an actual “rights” perspective. Think about the entire copyright issue as a “too big to fail” issue.

zlacker