GitHub Copilot, with “public code” blocked, emits my copyrighted code

>>davidg+(OP)
I just tested it myself on a random c file I created in the middle of a rust project I'm working on, it reproduced his full code verbatim from just the function header so clearly it does regurgitate proprietary code unlike some people have said, I do not have his source so co-pilot isn't just using existing context.

I've been finding co-pilot really useful but I'll be pausing it for now, and I'm glad I have only been using it on personal projects and not anything for work. This crosses the line in my head from legal ambiguity to legal "yeah that's gonna have to stop".

>>ianbut+ce
Searching for the function names in his libraries, I'm seeing some 32,000 hits.

I suspect he has a different problem which (thanks to Microsoft) is now a problem he has to care about: his code probably shows up in one or more repos copy-pasted with improper LGPL attribution. There'd be no way for Copilot to know that had happened, and it would have mixed in the code.

(As a side note: understanding why an ML engine outputs a particular result is still an open area of research AFAIK.)

>>shadow+Wf
Well yes, there'd be no way for the copilot model, as currently specified and trained, to know.

But it IS possible to train a model for that. In fact, I believe ML models can be fantastic "code archaeologists", giving us insights into not just direct copying, but inspiration and idioms as well. They don't just have the code, they have commit histories with timestamps.

A causal fact which these models could incorporate, is that we know data from the past wasn't influenced by data from the future. I believe that is a lever to pry open a lot of wondrous discoveries, and I can't wait until a model with this causal assumption is let loose on Spotify's catalog, and we get a computer's perspective on who influenced who.

But in the meantime, discovering where copy-pasted code originated should be a lot easier.

>>vinter+k71
Ah, a plagiarism checker that can understand simple code transformation and find the original source? Sounds like a good idea for patent trolls and I have no idea about how/if copyright laws can be apply in this case. Does copying the idea but not copying the code verbatim constitutes copyright violation?

>>pca006+ro1
The patent troll version of the algorithm needs the victim's bank balance as input too. In fact that's probably all it needs.

It would be much more valuable for people who care about the truth.

zlacker