GitHub Copilot, with “public code” blocked, emits my copyrighted code

>>davidg+(OP)
Howdy, folks. Ryan here from the GitHub Copilot product team. I don’t know how the original poster’s machine was set-up, but I’m gonna throw out a few theories about what could be happening.

If similar code is open in your VS Code project, Copilot can draw context from those adjacent files. This can make it appear that the public model was trained on your private code, when in fact the context is drawn from local files. For example, this is how Copilot includes variable and method names relevant to your project in suggestions.

It’s also possible that your code – or very similar code – appears many times over in public repositories. While Copilot doesn’t suggest code from specific repositories, it does repeat patterns. The OpenAI codex model (from which Copilot is derived) works a lot like a translation tool. When you use Google to translate from English to Spanish, it’s not like the service has ever seen that particular sentence before. Instead, the translation service understands language patterns (i.e. syntax, semantics, common phrases). In the same way, Copilot translates from English to Python, Rust, JavaScript, etc. The model learns language patterns based on vast amounts of public data. Especially when a code fragment appears hundreds or thousands of times, the model can interpret it as a pattern. We’ve found this happens in <1% of suggestions. To ensure every suggestion is unique, Copilot offers a filter to block suggestions >150 characters that match public data. If you’re not already using the filter, I recommend turning it on by visiting the Copilot tab in user settings.

This is a new area of development, and we’re all learning. I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs. My biggest take-away: LLM maintainers (like GitHub) must transparently discuss the way models are built and implemented. There’s a lot of reverse-engineering happening in the community which leads to skepticism and the occasional misunderstanding. We’ll be working to improve on that front with more blog posts from our engineers and data scientists over the coming months.

>>_ryanj+2z
Later in the thread he stated the code was not on the machine he tested copilot with.

Copilot training data should have been sanitized better.

In addition: any code that is produced by copilot that uses a source that is licensed, MUST follow the practices of that license, including copyright headers.

>>binary+uB
Right - but if someone pushes the same code to github and changes the licence file to say "public domain", what's the legally correct way to proceed? What's the morally correct way to proceed?

>>daniel+XP
Legally, if you're publishing a derived work without legitimate permission then you're civilly liable for statutory + actual damages, the only thing you're avoiding is the treble damages for wilful infringement.

Morally I'd say you should make a reasonable good faith effort to verify that you have a real license for everything you're using. When you're importing something on the scale of "all of Github" that means a bit more effort than just blindly trusting the file in the repository. When I worked with an F500 we would have a human explicitly review the license of each dependency; the review was pretty cursory, but it would've been enough to catch someone blatantly ripping off a popular repo.

>>lmm+VU
How do you know GH didn't? Maybe they only included repos with LICENSE.MD files which followed a known permissive licence?

What if a particular piece of code is licensed restrictively, and then (assuming without malice) accidentally included in a piece of software with a permissive license?

What if a particular piece of code is licensed permissively (in a way that allows relicensing, for example), but then included in a software package with a more restrictive licence. How could you tell if the original code is licensed permissively or not?

At what point do Github have to become absolute arbiters of the original authorship of the code in order to determine who is authorised to issue licenses for the code? How would they do so? How could you prove ownership to Github? What consequences could there be if you were unable to prove ownership?

That's before we even get to more nuanced ethical questions like a human learning to code will inevitably learn from reading code, even if the code they read is not permissively licensed. Why then, would an AI learning to code not be allowed to do the same?

>>d1sxey+u31
The “it’s really hard” argument isn’t a very good argument in my opinion?

If we hold reproductions of a single repository to a certain standard, the same standard should probably apply to mass reproductions. For a single repository, it’s your responsibility to make sure it’s used according to the license.

Are there slightly gray edge cases? Of course, but they’re not -that- grey. If I reproduced part of a book from a source that claimed incorrectly it was released under a permissive license, I would still be liable for that misuse. Especially if I was later made aware of the mistake and didn’t correct it.

If something is prohibitively difficult maybe we should sometimes consider that more work is required to enable the circumstances for it to be a good idea, rather than starting from the position that we should do it and moulding what we consider reasonable around that starting assumption.

>>Ineffa+p61
If someone uploads something and says 'hey, this is some code, this is the appropriate licence for it', it is their mistake, it is in violation of Github's terms of service, and may even be fraudulent. [0].

I'm also not sure that Copilot is just reproducing code, but that's a separate discussion.

> If I reproduced part of a book from a source that claimed incorrectly it was released under a permissive license, I would still be liable for that misuse. Especially if I was later made aware of the mistake and didn’t correct it.

I don't believe that's correct in the first instance (at least from a criminal perspective). If someone misrepresents to you that they have the right to authorise you to publish something, and it turns out they don't have that right, you did not willingly infringe and are not liable for the infringement from a criminal perspective[1]. From a civil perspective, likely the copyright owner could still claim damages from you if you were unable to reach a settlement. A court would probably determine the damages to award based on real damages (including loss of earnings for the content creator), rather than anything punitive if it's found that

Further, most jurisdictions have exceptions for short extracts of a larger copyrighted work (e.g. quotes from a book), which may apply to Copilot.

This is my own code, I wrote it myself just now. Can I copyright it?

``` function isOdd (num) { if (num % 2 === 0) { return true; } else { return false; } } ```

What about the following:

``` function isOddAndNotSunday (num) { const date = new Date(); if (num % 2 === 0 && date.getDay() > 0) { return true; } else { return false; } } ```

Where do we draw the line?

[0]: https://docs.github.com/en/site-policy/github-terms/github-t... [1]: https://www.law.cornell.edu/uscode/text/17/506

>>d1sxey+X91
> From a civil perspective, likely the copyright owner could still claim damages from you if you were unable to reach a settlement. A court would probably determine the damages to award based on real damages (including loss of earnings for the content creator), rather than anything punitive if it's found that

There are statutory damages on top of your actual damages. $50k per act of infringement. No reason for the copyright holder to settle for less when it's an open and shut case.

> Further, most jurisdictions have exceptions for short extracts of a larger copyrighted work (e.g. quotes from a book), which may apply to Copilot.

Quotes do not automatically get an exception just because they're taken from a larger work, they might be excepted either because they were de minimis (essentially because they were too short to be copyrightable) or because they were fair use (which is a complex question that takes into account the purpose and context, which Copilot is very unlikely to satisfy because it's not quoting other code for the purpose of saying something about it).

> Where do we draw the line?

Circuit specific; some but not all circuits use the AFC test. It sounds like this code was both long enough and creative/innovative enough to be well on the wrong side of it though.

>>lmm+fB1
I am not sure about statutory damages.

As I understand it, the complainant may CHOOSE to request the court to levy statutory damages rather than actual damages at any point, but is not entitled to both actual AND statutory (17 U.S. Code § 504)

It also seems to be absolutely capped at 30K per infringement, not 50, and ranges up from $750. It also seems that if the "court finds, that such infringer was not aware and had no reason to believe that his or her acts constituted an infringement of copyright, the court in its discretion may reduce the award of statutory damages to a sum of not less than $200."

I think you are probably right that this specific function is copyrightable though, but taken overall, I think Microsoft's lawyers have probably concluded that they would win any challenge on this. Microsoft have lost court battles before though, so who knows?

zlacker