GitHub Copilot, with “public code” blocked, emits my copyrighted code

>>davidg+(OP)
I’ve noticed that people tend to disapprove of AI trained on their profession’s data, but are usually indifferent or positive about other applications of AI.

For example, I know artists who are vehemently against DALL-E, Stable Diffusion, etc. and regard it as stealing, but they view Copilot and GPT-3 as merely useful tools. I also know software devs who are extremely excited about AI art and GPT-3 but are outraged by Copilot.

For myself, I am skeptical of intellectual property in the first place. I say go for it.

>>kweing+v6
When Joe Rando plays a song from 1640 on a violin he gets a copyright claim on Youtube. When Jane Rando uses devtools to check a website source code she gets sued.

When Microsoft steals all code on their platform and sells it, they get lauded. When "Open" AI steals thousands of copyrighted images and sells them, they get lauded.

I am skeptical of imaginary property myself, but fuck this one set of rules for the poor, another set of rules for the masses.

>>tpxl+O7
I think copilot is a clearer copyright violation than any of the stable diffusion projects though because code has a much narrower band of expression than images. It's really easy to look at the output of CoPilot and match it back to the original source and say these are the same. With stable diffusion it's much closer to someone remixing and aping the images than it is reproducing originals.

I haven't been following super closely but I don't know of any claims or examples where input images were recreated to a significant degree by stable diffusion.

>>rtkwe+Te
The reason why it's easy to match Copilot results back to the original source is that the users are starting with prompts that match their public code, deliberately to cause prompt regurgitation.

Stable Diffusion actually has a similar problem. Certain terms that directly call up a particular famous painting by name - say, the Mona Lisa[0] - will just produce that painting, possibly tiled on top of itself, and it won't bother with any of the other keywords or phrases you throw at it.

The underlying problem is that the AI just outright forgets that it's supposed to create novel works when you give it anything resembling the training set data. If it was just that the AI could spit out training set data when you ask for it, I wouldn't be concerned[1], but this could also happen inadvertently. This would mean that anyone using Copilot to write production code would be risking copyright liability. Through the AI they have access to the entire training set, and the AI has a habit of accidentally producing output that's substantially similar to it. Those are the two prongs of a copyright infringement claim right there.

[0] For the record I was trying to get it to draw a picture of the Mona Lisa slapping Yoshikage Kira across the cheek

[1] Anyone using an AI system to "launder" creative works is still infringing copyright. AI does not carve a shiny new loophole in the GPL.

>>kmeist+5E
> The reason why it's easy to match Copilot results back to the original source is that the users are starting with prompts that match their public code, deliberately to cause prompt regurgitation.

Sounds like MS has devised a massive automated code laundering racket.

zlacker