GitHub Copilot, with “public code” blocked, emits my copyrighted code

>>davidg+(OP)
Howdy, folks. Ryan here from the GitHub Copilot product team. I don’t know how the original poster’s machine was set-up, but I’m gonna throw out a few theories about what could be happening.

If similar code is open in your VS Code project, Copilot can draw context from those adjacent files. This can make it appear that the public model was trained on your private code, when in fact the context is drawn from local files. For example, this is how Copilot includes variable and method names relevant to your project in suggestions.

It’s also possible that your code – or very similar code – appears many times over in public repositories. While Copilot doesn’t suggest code from specific repositories, it does repeat patterns. The OpenAI codex model (from which Copilot is derived) works a lot like a translation tool. When you use Google to translate from English to Spanish, it’s not like the service has ever seen that particular sentence before. Instead, the translation service understands language patterns (i.e. syntax, semantics, common phrases). In the same way, Copilot translates from English to Python, Rust, JavaScript, etc. The model learns language patterns based on vast amounts of public data. Especially when a code fragment appears hundreds or thousands of times, the model can interpret it as a pattern. We’ve found this happens in <1% of suggestions. To ensure every suggestion is unique, Copilot offers a filter to block suggestions >150 characters that match public data. If you’re not already using the filter, I recommend turning it on by visiting the Copilot tab in user settings.

This is a new area of development, and we’re all learning. I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs. My biggest take-away: LLM maintainers (like GitHub) must transparently discuss the way models are built and implemented. There’s a lot of reverse-engineering happening in the community which leads to skepticism and the occasional misunderstanding. We’ll be working to improve on that front with more blog posts from our engineers and data scientists over the coming months.

>>_ryanj+2z
Hi Ryan, thanks for posting here.

So I had something similar happen to the OP a couple of days ago. I'm on friendly terms with a competing codebase's developer and have confirmed the following with them, both mine and it are closed source and hosted on github.

Halfway through building something I was given a block of code by copilot, which contained a copyright line with my competitors name, company number and email address.

Those details have never, ever been published in a public repository.

How did that happen?

>>esskay+oc1
> Those details have never, ever been published in a public repository.

The most simple answer would be that this is false, it was published somewhere but you are not aware of it.

>>elcome+wg1
An equally simple answer is that copilot is pulling code (or at least analyzing) from repositories that are not public.

>>grecy+ck1
I think that's very unlikely, they said and repeated that they are not using private code. People catching them lying on this would be very bad for GitHub.

>>elcome+BH1
This is some highly impressive logic right here.

Proposition: "They don't use private code".

Proof: "They said they don't use private code. Either the private code appearing is published somewhere else, or they are using private code. Lying would be bad. Therefore the code is published somewhere else, and they don't use private code".

>>andrep+HI2
I would say that the logic is more like:

Proposition: "They either do not use private code or they did something very very stupid."

Proof: "Not using private code is very easy (for example google does not train its models on workspace users' data, which is why they get inferior features) and they promised multiple time not to use private code so doing in would be hard to justify"

zlacker