So obviously the source of the error is one or more third parties, not Microsoft, and it's obviously impossible for Microsoft to be responsible in advance for what other people claim to license.
It does make you wonder, however, if Microsoft ought to be responsible for obeying a type of "DCMA takedown" request that should apply to ML models -- not on all 32,000 sources but rather on a specified text snippet -- to be implemented the next time the model is trained (or if it's practical for filters to be placed on the output of the existing model). I don't know what the law says, but it certainly seems like a takedown model would be a good compromise here.
No. Look at the insane YouTube copystrike situation. Why shouldn't Microsoft be held to the same standards?
Also a major problem with YouTube is not with DCMA itself, but it how it implements the system to allow for abusive takedowns without repercussions for the abusers.
> So obviously the source of the error is one or more third parties, not Microsoft, and it's obviously impossible for Microsoft to be responsible in advance for what other people claim to license.
Copilot is Github's product, and Microsoft owns Github. They are responsible for how that product functions. In this case, they should be held responsible for the training data they used for the ML model. Giving them the benefit of the doubt here, at minimum they chose which random third parties to believe were honest and correct. Without giving them the benefit of the doubt, they lied about what data sets they used to train it.
To try and put it simpler, let's say a company comes along and tells the world "we're selling this cool book-writing robot, don't worry it won't ever spit out anyone else's books" and then the robot regurgitates an entire chapter from Stephen King's Pet Sematary, is that the fault of Stephen King or the person selling the robot?
Well probably no, they didn't pick and choose at all, they just "chose" everyone who put code online with a license. Which is a legal statement of ownership by each of those people, and implies legal liability as well.
> is that the fault of Stephen King or the person selling the robot?
Well, there's certainly an argument to be made that it's neither -- it's the fault of the person who claimed Stephen King's work as their own with a legal notice that it was licensed freely to anyone. That person is the one committing theft/fraud.
The point is that with ML training data, such a vast quantity is required that it's unreasonable to expect humans to be able to research and guarantee the legal provenance of it all. A crawler simply believes that licenses, which are legally binding statements, are made by actual owners, rather than being fraud. It does seem reasonable to address the issue with takedowns, however.
What you're describing is a choice. They chose which people to believe, with zero vetting.
> The point is that with ML training data, such a vast quantity is required that it's unreasonable to expect humans to be able to research and guarantee the legal provenance of it all.
I'm not sure what you're presenting here is actually true. A key part of ML training is the training part. Other domains require a pass/fail classification of the model's output (see image identification, speech recognition, etc.) so why is source code any different? The idea that "it's too much data" is absolutely a cop-out and absurd, especially for a company sitting on ~$100B in cash reserves.
Your argument kind of demonstrates the underlying point here: They took the cheapest/easiest option and it's harmed the product.
> A crawler simply believes that licenses, which are legally binding statements, are made by actual owners, rather than being fraud. It does seem reasonable to address the issue with takedowns, however.
Yes, and to reiterate, they chose this method. They were not obligated to do this, they were not forced to pick this way of doing things, and given the complete lack of transparency it's a large leap of faith to assume that their training data simply looked at LICENSE files to determine which licenses were present.
For what it's worth, it doesn't seem that that's what OpenAI did when they trained the model initially in their paper[1]:
Our training dataset was collected in May 2020 from 54 mil-
lion public software repositories hosted on GitHub, contain-
ing 179 GB of unique Python files under 1 MB. We filtered
out files which were likely auto-generated, had average line
length greater than 100, had maximum line length greater
than 1000, or contained a small percentage of alphanumeric
characters. After filtering, our final dataset totaled 159 GB.
I have not seen anything concrete about any further training after that, largely because it isn't transparent.If we have 32 000 copies of the same code in a large database with a linking structure betwen the records then we should be able to discern which are the high provenance sources in the network, and which are the low provenance copies. The problem is after all, remarkedly similar to building a search engine.
Then there's also parallel discovery. People frequently come to the same solution at roughly the same time, completely independently. And this is nothing new. For instance, who discovered calculus? Newton or Leibniz? This was a roaring controversy at the same time with both claiming credit. The reality is that they both likely discovered it, completely independently, at about the same time. And there's a whole lot more people working on stuff than than in Newton's time!
There's also just parallel creation. Task enough people with creating an octree based level-of-detail system in computer graphics and you're going to get a lot of relatively lengthy code that is going to look extremely similar, in spite of the fact that it's a generally esoteric and non-trivial problem.