zlacker

[return to "GitHub Copilot, with “public code” blocked, emits my copyrighted code"]
1. crazyg+7p[view] [source] 2022-10-16 23:16:46
>>davidg+(OP)
As some other commenters have noted, it seems like the copyrighted code is being copied and pasted into many other codebases (shadowgovt says they found 32,000 hits), which are then (illegally) representing an incorrect license.

So obviously the source of the error is one or more third parties, not Microsoft, and it's obviously impossible for Microsoft to be responsible in advance for what other people claim to license.

It does make you wonder, however, if Microsoft ought to be responsible for obeying a type of "DCMA takedown" request that should apply to ML models -- not on all 32,000 sources but rather on a specified text snippet -- to be implemented the next time the model is trained (or if it's practical for filters to be placed on the output of the existing model). I don't know what the law says, but it certainly seems like a takedown model would be a good compromise here.

◧◩
2. BeefWe+zu[view] [source] 2022-10-17 00:05:02
>>crazyg+7p
Uhh, I'm gonna have to disagree hard on this take:

> So obviously the source of the error is one or more third parties, not Microsoft, and it's obviously impossible for Microsoft to be responsible in advance for what other people claim to license.

Copilot is Github's product, and Microsoft owns Github. They are responsible for how that product functions. In this case, they should be held responsible for the training data they used for the ML model. Giving them the benefit of the doubt here, at minimum they chose which random third parties to believe were honest and correct. Without giving them the benefit of the doubt, they lied about what data sets they used to train it.

To try and put it simpler, let's say a company comes along and tells the world "we're selling this cool book-writing robot, don't worry it won't ever spit out anyone else's books" and then the robot regurgitates an entire chapter from Stephen King's Pet Sematary, is that the fault of Stephen King or the person selling the robot?

◧◩◪
3. crazyg+SH[view] [source] 2022-10-17 02:18:31
>>BeefWe+zu
> Giving them the benefit of the doubt here, at minimum they chose which random third parties to believe were honest and correct.

Well probably no, they didn't pick and choose at all, they just "chose" everyone who put code online with a license. Which is a legal statement of ownership by each of those people, and implies legal liability as well.

> is that the fault of Stephen King or the person selling the robot?

Well, there's certainly an argument to be made that it's neither -- it's the fault of the person who claimed Stephen King's work as their own with a legal notice that it was licensed freely to anyone. That person is the one committing theft/fraud.

The point is that with ML training data, such a vast quantity is required that it's unreasonable to expect humans to be able to research and guarantee the legal provenance of it all. A crawler simply believes that licenses, which are legally binding statements, are made by actual owners, rather than being fraud. It does seem reasonable to address the issue with takedowns, however.

[go to top]