GitHub Copilot, with “public code” blocked, emits my copyrighted code

>>davidg+(OP)
As some other commenters have noted, it seems like the copyrighted code is being copied and pasted into many other codebases (shadowgovt says they found 32,000 hits), which are then (illegally) representing an incorrect license.

So obviously the source of the error is one or more third parties, not Microsoft, and it's obviously impossible for Microsoft to be responsible in advance for what other people claim to license.

It does make you wonder, however, if Microsoft ought to be responsible for obeying a type of "DCMA takedown" request that should apply to ML models -- not on all 32,000 sources but rather on a specified text snippet -- to be implemented the next time the model is trained (or if it's practical for filters to be placed on the output of the existing model). I don't know what the law says, but it certainly seems like a takedown model would be a good compromise here.

>>crazyg+7p
I don't think that Microsoft can claim to be blameless here "because it is too hard".

If we have 32 000 copies of the same code in a large database with a linking structure betwen the records then we should be able to discern which are the high provenance sources in the network, and which are the low provenance copies. The problem is after all, remarkedly similar to building a search engine.

>>314+fW
There is no formal linking structure in many, if not most cases. Ctrl+V is the weapon of choice of many a programmer. To say nothing of somebody then adding superficial changes to the code to, for instance, fit their personal style or adapt it into their project. And then of course on top of it, Github is not the alpha and omega of code. The original code have been published anywhere, or even nowhere in a case such as theft.

Then there's also parallel discovery. People frequently come to the same solution at roughly the same time, completely independently. And this is nothing new. For instance, who discovered calculus? Newton or Leibniz? This was a roaring controversy at the same time with both claiming credit. The reality is that they both likely discovered it, completely independently, at about the same time. And there's a whole lot more people working on stuff than than in Newton's time!

There's also just parallel creation. Task enough people with creating an octree based level-of-detail system in computer graphics and you're going to get a lot of relatively lengthy code that is going to look extremely similar, in spite of the fact that it's a generally esoteric and non-trivial problem.

zlacker