GitHub Copilot, with “public code” blocked, emits my copyrighted code

>>davidg+(OP)
Howdy, folks. Ryan here from the GitHub Copilot product team. I don’t know how the original poster’s machine was set-up, but I’m gonna throw out a few theories about what could be happening.

If similar code is open in your VS Code project, Copilot can draw context from those adjacent files. This can make it appear that the public model was trained on your private code, when in fact the context is drawn from local files. For example, this is how Copilot includes variable and method names relevant to your project in suggestions.

It’s also possible that your code – or very similar code – appears many times over in public repositories. While Copilot doesn’t suggest code from specific repositories, it does repeat patterns. The OpenAI codex model (from which Copilot is derived) works a lot like a translation tool. When you use Google to translate from English to Spanish, it’s not like the service has ever seen that particular sentence before. Instead, the translation service understands language patterns (i.e. syntax, semantics, common phrases). In the same way, Copilot translates from English to Python, Rust, JavaScript, etc. The model learns language patterns based on vast amounts of public data. Especially when a code fragment appears hundreds or thousands of times, the model can interpret it as a pattern. We’ve found this happens in <1% of suggestions. To ensure every suggestion is unique, Copilot offers a filter to block suggestions >150 characters that match public data. If you’re not already using the filter, I recommend turning it on by visiting the Copilot tab in user settings.

This is a new area of development, and we’re all learning. I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs. My biggest take-away: LLM maintainers (like GitHub) must transparently discuss the way models are built and implemented. There’s a lot of reverse-engineering happening in the community which leads to skepticism and the occasional misunderstanding. We’ll be working to improve on that front with more blog posts from our engineers and data scientists over the coming months.

>>_ryanj+2z
This doesn’t at all address the primary issue, which is one of licensing.

Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”

If someone violated the copyright of a song by sampling too much of it and released it in the public domain (or failed to claim it at all), and you take the entire sample from them, would that hold up in a legal setting? I doubt it.

>>svnt+pD
> Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”

If you do something, it's ultimately you who has to make sure that it is not against the law. "I didn't know" is never a good defense. If you pay with counterfeit cash, it is you who will be arrested, even if you didn't know it was counterfeit. If you use code from somewhere else (no matter if it's by copy/pasting or by using Copilot), it is you who has to make certain that it doesn't infringe on any copyright.

Just because a tool can (accidentally) make you break the law, doesn't mean the tool is to blame (cf. BitTorrent, Tor, KaliLinux, ...)

>>dark-s+Q81
BitTorrent doesn't automatically download a pirated copy of Lion King when you ask it for something to watch...

>>unsafe+ba1
BitTorrent (and, to a larger degree, EDonkey) did and still do that. Who tells you that what you're downloading is indeed what you think it is. You can click on a magnet link that claims to download a Debian ISO just to find out later that it's something else entirely. To make matters worse, BitTorrent even uploads to potentially hundreds of other clients while you're still downloading, so while downloading something might not be illegal in your jurisdiction, uploading/distributing most certainly is, and you can get into lots of trouble for uploading (parts of a) copyrighted wortk to hundreds or thousands of other users

>>dark-s+Do1
> You can click on a magnet link that claims to download a Debian ISO just to find out later that it's something else entirely

This is just fear mongering, the same exact thing can happen with a web browser, I click a link to view an image of a cat but... oops, it was actually a Getty copyrighted picture of a dog! Oh nooooo.

On the web that sort of thing is actually common, but bit torrent? I have never downloaded a torrent to find it was something other than what I expected. Never have I seen a movie masquerading as a Debian ISO. That's nothing more than a joke people use to make light of their (deliberate) copyright infringement.

Furthermore, is there even any bit torrent client that will recommend copyrighted content to you, rather than merely download what you tell it to? I've not seen one. Search engines, in my browser, do that sort of recommendation but bit torrent clients do what I tell them to. Including seeding to others, which is optional but recommended for obvious reasons.

>>Michae+0v2
> I click a link to view an image of a cat but... oops, it was actually a Getty copyrighted picture of a dog! Oh nooooo

Sorry, what?

Downloading copyrighted content is very, very rarely the problem.

It's the uploading (the sharing!) of copyrighted content where you actually get into trouble.

>>logifa+TT2
If you actually care, then simply configure your client to leech. Every client I've ever used or heard of supports this.

But more to the point, getting tricked into seeding a copyrighted movie by a torrent masquerading as a Debian ISO isn't something that actually happens. That's absurd FUD.

>>Michae+GV2
Errm, you posted:

> "This is just fear mongering, the same exact thing can happen with a web browser, I click a link to view an image of a cat but... oops, it was actually a Getty copyrighted picture of a dog! Oh nooooo."

No-one cares whether you download an open-sourced photo of a cat or a copyrighted photo of a dog.

Why would anyone claim that?

It's a terrible comparison to torrents.

zlacker