zlacker

The reason why it's easy to match Copilot results back to the original source is that the users are starting with prompts that match their public code, deliberately to cause prompt regurgitation.

Stable Diffusion actually has a similar problem. Certain terms that directly call up a particular famous painting by name - say, the Mona Lisa[0] - will just produce that painting, possibly tiled on top of itself, and it won't bother with any of the other keywords or phrases you throw at it.

The underlying problem is that the AI just outright forgets that it's supposed to create novel works when you give it anything resembling the training set data. If it was just that the AI could spit out training set data when you ask for it, I wouldn't be concerned[1], but this could also happen inadvertently. This would mean that anyone using Copilot to write production code would be risking copyright liability. Through the AI they have access to the entire training set, and the AI has a habit of accidentally producing output that's substantially similar to it. Those are the two prongs of a copyright infringement claim right there.

[0] For the record I was trying to get it to draw a picture of the Mona Lisa slapping Yoshikage Kira across the cheek

[1] Anyone using an AI system to "launder" creative works is still infringing copyright. AI does not carve a shiny new loophole in the GPL.

replies(4): >>xani_+D3 >>thetea+u5 >>joe-co+D5 >>llimll+p6

>>kmeist+(OP)
> The reason why it's easy to match Copilot results back to the original source is that the users are starting with prompts that match their public code, deliberately to cause prompt regurgitation.

The reason doesn't really matter...

replies(1): >>lofatd+L5

>>kmeist+(OP)
> The reason why it's easy to match Copilot results back to the original source is that the users are starting with prompts that match their public code, deliberately to cause prompt regurgitation.

Sounds like MS has devised a massive automated code laundering racket.

replies(1): >>ISL+B7

>>kmeist+(OP)
I think that's backwards. The AI doesn't "forget", it never even knew what novelty is in the first place.

>>xani_+D3
GP is just highlighting why this is so common and often a challenging edge case. If you ask it for something that's exactly in its dataset, the "best" solution that minimizes loss will be that existing code. Thus, it's somewhat intrinsic to applying statistical learning to text completion.

This means MS really shouldn't have used copyleft code at all, and really shouldn't be selling copilot in this state, but "luckily" for them, short of a class action suit I don't really see any recourse for the programmers who's work they're reselling.

replies(2): >>fweime+le >>kmeist+0v2

>>kmeist+(OP)
I tried some very simple queries with copilot on random stuff, and tried to trace it back to the source. I was successful about 1/3 of the time.

(Sorry I didn't log my experiment results at the time. None of it was related to work I'd done - I used time adjustment functions if I remember correctly)

>>thetea+u5
Seems more like a massive class-action copyright target, potentially at ($50k/infraction) x (the number of usages).

replies(2): >>bugfix+Jf >>thetea+IC

>>lofatd+L5
Pretty much all code they have requires attribution, and based on reports, Copilot does not generate that along with the code. So excluding copyleft code (how would you even do that?) does not address the issue (assuming that the source code produced is actually a derivative work).

replies(1): >>lofatd+6i

>>ISL+B7
Good. Where do I sign up?

replies(1): >>ISL+k83

>>fweime+le
That's a good point. I was thinking that during the curation phase of the dataset they should check for a LICENSE.txt file in the repo, and just batch exclude all copyleft/copyright containing repositories. This obviously won't handle every case as you say, and when it does generate copyleft code it will fail to attribute, but hopefully not having copyleft code in its dataset or less of it reduces the chance it generates code that perfectly satisfies its loss function by being exactly like something its seen before.

The main problem I see with generating attribution is that the algorithm obviously doesn't "know" that it's generating identical code. Even in the original twitter post, the algorithm makes subtle and essentially semantically synonymous changes (like the changing the commenting style). So for all intents and purposes it can't attribute the function because it doesn't know _where_ it's coming from and copied code is indistinguishable from de novo code. Copilot will probably never be able to attribute code short of exhaustively checking the outputs using some symbolical approach against a database of copyleft/copyrighted code.

>>ISL+B7
Both.

>>lofatd+L5
Suing Microsoft for training Copilot on your code would require jumping over the same hurdle that the Authors Guild could not: i.e. that it is fair use to scan a massive corpus[0] of texts (or images) in order to search through them.

My real worry is downstream infringement risk, since fair use is non-transitive. Microsoft can legally provide you a code generator AI, but you cannot legally use regurgitated training set output[1]. GitHub Copilot is creating all sorts of opportunities to put your project in legal jeopardy and Microsoft is being kind of irresponsible with how they market it.

[0] Note that we're assuming published work. Doing the exact same thing Microsoft did, but on unpublished work (say, for irony's sake, the NT kernel source code) might actually not be fair use.

[1] This may give rise to some novel inducement claims, but the irony of anyone in the FOSS community relying on MGM v. Grokster to enforce the GPL is palpable.

>>bugfix+Jf
Find a good and ambitious copyright attorney with some free capacity.

Also, register your code with the copyright office.

Edit: Apparently, with the #1 post on HN right now, you could also just go here: https://githubcopilotinvestigation.com/