GitHub Copilot, with “public code” blocked, emits my copyrighted code

>>davidg+(OP)
Howdy, folks. Ryan here from the GitHub Copilot product team. I don’t know how the original poster’s machine was set-up, but I’m gonna throw out a few theories about what could be happening.

If similar code is open in your VS Code project, Copilot can draw context from those adjacent files. This can make it appear that the public model was trained on your private code, when in fact the context is drawn from local files. For example, this is how Copilot includes variable and method names relevant to your project in suggestions.

It’s also possible that your code – or very similar code – appears many times over in public repositories. While Copilot doesn’t suggest code from specific repositories, it does repeat patterns. The OpenAI codex model (from which Copilot is derived) works a lot like a translation tool. When you use Google to translate from English to Spanish, it’s not like the service has ever seen that particular sentence before. Instead, the translation service understands language patterns (i.e. syntax, semantics, common phrases). In the same way, Copilot translates from English to Python, Rust, JavaScript, etc. The model learns language patterns based on vast amounts of public data. Especially when a code fragment appears hundreds or thousands of times, the model can interpret it as a pattern. We’ve found this happens in <1% of suggestions. To ensure every suggestion is unique, Copilot offers a filter to block suggestions >150 characters that match public data. If you’re not already using the filter, I recommend turning it on by visiting the Copilot tab in user settings.

This is a new area of development, and we’re all learning. I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs. My biggest take-away: LLM maintainers (like GitHub) must transparently discuss the way models are built and implemented. There’s a lot of reverse-engineering happening in the community which leads to skepticism and the occasional misunderstanding. We’ll be working to improve on that front with more blog posts from our engineers and data scientists over the coming months.

>>_ryanj+2z
Copilot is the largest disrespect to open source software I have ever seen. It is a derivative work of open source code and it is not released under the same license. It is also capable of laundering open source code. Congratulations for working on the "extinguish" phase of embrace, extend, extinguish for open source.

>>carom+nQ
The chilling effect of this decision is something everybody who uses open source software should be worried about.

>>guitar+BW
I'm worried that this will harm open source, but in a different way: lots of people switching to unfree "no commercial use at all" licenses, special exemptions in licenses, and so on. I'm also worried that it'll harm scientific progress by criminalizing a deeply harmless and commonplace activity such as "learning from open code" when it's AIs that do it. And of course retarding the progress of AI code assistance, a vital component of scaling up programmer productivity.

From an AI safety perspective, I'm also worried it will accelerate the transition to self-learning code, ie. the model both generating and learning from source code, which is a crucial step on the way to general artificial intelligence that we are not ready for.

>>Feepin+6v1
Horrible framing. AI is not learning from code. The model is a function. The AI is a derivative work of its training material. They built a program based on open source code and failed to open source it.

They also built a program that outputs open source code without tracking the license.

This isn't a human who read something and distilled a general concept. This is a program that spits out a chain of tokens. This is more akin to a human who copied some copywritten material verbatim.

>>carom+jL3
The brain is a function. You're positing a distinction without a difference.

zlacker