I've been finding co-pilot really useful but I'll be pausing it for now, and I'm glad I have only been using it on personal projects and not anything for work. This crosses the line in my head from legal ambiguity to legal "yeah that's gonna have to stop".
do you have the "don't reproduce code verbatim" preference set?
I suspect he has a different problem which (thanks to Microsoft) is now a problem he has to care about: his code probably shows up in one or more repos copy-pasted with improper LGPL attribution. There'd be no way for Copilot to know that had happened, and it would have mixed in the code.
(As a side note: understanding why an ML engine outputs a particular result is still an open area of research AFAIK.)
Not to mention this code wasn't public so it's kind of moot, having someone's private code be generated into my project is very bad.
As to the option, I do not, I wasn't even aware of the option, but it's pretty silly to me that's not on by default, or even really an option. That should probably be enabled with no way to toggle it without editing the extension.
I understand there's no way for the model to know, but it's really on Microsoft then to ensure no private, or poorly licensed or proprietary code is included in the training set. That sounds like a very tall order, but I think they're going to have to otherwise they're eventually going to run into legal problems with someone who has enough money to make it hurt for them.
I grant that if most people are using it one way here I was likely wrong for the way it is typically used by the normal open source community, I followed up with a reply saying it would likely be more correct for me to have said "improperly licensed" to be included in the training set.
Still it being private means it probably shouldn't be in the training set anyway regardless of license, because in the future, truly proprietary code could be included, or code without any license which reserves all right to the creator.
You would have to just hope that you can take down every instance of your code and keep it down, all while copilot keeps making more instances for the next version to train on and plagiarize.
But that doesn't make it any better.
those two things exist at the same time.
try reading a licence now and again!
If your standard is “Github should have an oracle to the US court system and predict what the outcome of a lawsuit alleging copyright infringement for a given snippet of code would be” then it is literally impossible for anyone to use any open source code ever because it might contain infringing code.
There is no chain of custody for this kind of thing which is what it would require.
Either that or we effectively get rid of software copyright as copilot can be used (or even claim to be used) to launder code of license restrictions. Eg No I didn't copy your code, I used copilot and it copied your code so I did nothing wrong.
So in this case copilot just looks at the situation like that someone gifted me this, and does not question if the person gifting was the real owner of the gift.
Copiloot doesn't obey GPL license, so they need to obtain written permission and pay license fees to be able to use code in their product.
But it IS possible to train a model for that. In fact, I believe ML models can be fantastic "code archaeologists", giving us insights into not just direct copying, but inspiration and idioms as well. They don't just have the code, they have commit histories with timestamps.
A causal fact which these models could incorporate, is that we know data from the past wasn't influenced by data from the future. I believe that is a lever to pry open a lot of wondrous discoveries, and I can't wait until a model with this causal assumption is let loose on Spotify's catalog, and we get a computer's perspective on who influenced who.
But in the meantime, discovering where copy-pasted code originated should be a lot easier.
All lines are associated to a commit, which has author/commit date. A reasonable guess as to which snippet was made first can be done
If this is the case, I can imagine people migrating of GitHub very quickly. I can also imagine some pretty nice lawsuits opening up.
Can Copilot prove that and link to the source LGPL code whenever it reproduces more than half a line of code from such a source?
Because without that clear attribution trail, nobody in their right mind would contaminate their codebase with possibly stolen code. Hell, some bad actor might purposefully publish a proprietary base full of stolen LGPL code, and run scanners on other products until they get a Copilot "bite". When that happens and you get sued, good luck finding the original open source code both you and your aggressor derive from.
It would be much more valuable for people who care about the truth.
That is why Copilot should have always been opt-in (explicitly ask original authors to provide their code to copilot training). Instead, they are simply stealing the code of others.
You might think it's unreasonable to build such a house-burning robot, but you have to realize that I actually designed it as a lawn-mowing robot. The robot will simply not value your life or property because its utility function is descended from my own, so may burn your house down in the regular course of its duties (if it decides to use a powerful laser to trim the grass to the exact nanometer). Sorry neighbor.
What do you expect me to do? NOT build this robot? How dare you stand in the way of progress!
If you can figure out a method of determining whether someone owns the code that doesn't involve, "try suing in court for copyright infringement and see if it sticks" then we're kinda stuck. Because just because a codebase contains an exact or similar snippet from another codebase doesn't mean that snippet reaches the threshold of copyrightable work. Or the reverse being that just because two code snippets look wildly different doesn't mean it's not infringement and detecting that automatically is solving the halting problem.
The thing you want for software to actually solve this is chain of custody which we don't have. If you require everyone assume everyone else could be lying or mistaken about infringement then using any open source project for anything becomes legal hot water.
In fact when you upload code to Github you grant them a license to do things like "display it" which you can't do if you don't actually own the copyright or have a license so even before the code is ever slurrped into Copilot the same exact legal situation arises as to wether Github is legally allowed to host the code at all. Can you imagine if when you uploaded code to Github you had to sign a document saying you owned the code and indemnifying Microsoft against any lawsuit alleging infringement o boy people would not enjoy that.
What a nightmare.
I'd say that constant code copying is massively pervasive, with no regard to licensing, always has been, and that's not really a bad thing, and attempts to stop it are going to be far more harmful than helpful.
From other comments, this developers "private" code was found in 30k+ public repositories with public attribution which is what created this issue.
Presumably your private code is not also present or leaked to any public repositories.