GitHub Copilot, with “public code” blocked, emits my copyrighted code

>>davidg+(OP)
Howdy, folks. Ryan here from the GitHub Copilot product team. I don’t know how the original poster’s machine was set-up, but I’m gonna throw out a few theories about what could be happening.

If similar code is open in your VS Code project, Copilot can draw context from those adjacent files. This can make it appear that the public model was trained on your private code, when in fact the context is drawn from local files. For example, this is how Copilot includes variable and method names relevant to your project in suggestions.

It’s also possible that your code – or very similar code – appears many times over in public repositories. While Copilot doesn’t suggest code from specific repositories, it does repeat patterns. The OpenAI codex model (from which Copilot is derived) works a lot like a translation tool. When you use Google to translate from English to Spanish, it’s not like the service has ever seen that particular sentence before. Instead, the translation service understands language patterns (i.e. syntax, semantics, common phrases). In the same way, Copilot translates from English to Python, Rust, JavaScript, etc. The model learns language patterns based on vast amounts of public data. Especially when a code fragment appears hundreds or thousands of times, the model can interpret it as a pattern. We’ve found this happens in <1% of suggestions. To ensure every suggestion is unique, Copilot offers a filter to block suggestions >150 characters that match public data. If you’re not already using the filter, I recommend turning it on by visiting the Copilot tab in user settings.

This is a new area of development, and we’re all learning. I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs. My biggest take-away: LLM maintainers (like GitHub) must transparently discuss the way models are built and implemented. There’s a lot of reverse-engineering happening in the community which leads to skepticism and the occasional misunderstanding. We’ll be working to improve on that front with more blog posts from our engineers and data scientists over the coming months.

>>_ryanj+2z
Hey Ryan! Have you ever done any reading on the Luddites? They weren't the anti technology, anti progress social force people think they were.

They were highly skilled laborers who knew how to operate complex looms. When auto looms came along, factory owners decided they didn't want highly trained, knowledgeable workers they wanted highly disposable workers. The Luddites were happy to operate the new looms, they just wanted to realize some of the profit from the savings in labor along with the factory owners. When the factory owners said no, the Luddites smashed the new looms.

Genuinely, and I'm not trying to ask this with any snark, do you view the work you do as similar to the manufacturers of the auto looms? The opportunity to reduce labor but also further the strength of the owner vs the worker? I could see arguments being made both ways and I'm curious about how your thoughts fall.

>>social+rY
What alternative are you suggesting?

Things turned out pretty great economy-wise for people in the UK. So that's a poor example even if Luddites didn't hate technology. Not working on the technology wouldn't have done the world any favours (nor the millions of people who wore the more affordable clothes it produced).

I personally think it'd be rewarding to make developers lives easier, essentially just saving the countless hours we spend googling + copy/pasting Stackoverflow answers.

Co-pilot is merely just one project in this technological development, even if a mega-corp like Microsoft doesn't do it ML is here to stay.

If you're concerned that software developers job security is at all at risk from co-pilot than you greatly misunderstand how software engineering works.

Auto-completing a few functions you'd copy/paste otherwise (or rewrite for the hundredth time) is a small part of building a piece of software. If they struggle with self-driving cars, I think you'll be alright.

At the end-of-the-day there's a big incentive for Github et al to solve this problem, a class action lawsuit is always an overhanging threat. Even if co-pilot doesn't make sense as a business and these pushback shut it down I doubt it will go away.

I'm personally confident the industry will eventually figure out the licensing issues. The industry will develop better automated detection systems and if it requires more explicit flagging, no-one is better positioned to apply that technologically than Github.

>>dmix+l01
The luddites were right. Everything is worse because they did not win, which is obvious if you spend a second thinking about what them not winning means: more concentrated wealth, more disenfranchised workers.

>>krageo+q31
It’s difficult for me to see how 2022 is worse than 1816 all things considered.

>>cdogl+Uh1
https://theweek.com/feature/briefing/1016752/the-real-cost-o...

Maybe not right this moment but our actions have consequences in the future.

For those who only see the next quarter, they're stoked.

For those who understand infinite growth is impossible and would simply like a livable world, they're horrified.

zlacker