First, as much as I don't like the idea of Copilot, it seems to be good for boilerplate code. However, the fact that boilerplate code exists is not because of some natural limitation of code; it exists because our programming languages are subpar at making good abstractions.
Here's an example: in Go, there is a lot of `if err == nil` error-handling boilerplate. Rust decided to make a better abstraction and shortened it to `?`.
(I could have gotten details wrong, but I think the point still stands.)
So I think a better way to solve the problem that Copilot solves is with better programming languages that help us have better abstractions.
Second, I personally think the legal justifications for Copilot are dubious at best and downright deception at worst, to say nothing of the ramifications of it. I wrote a whitepaper about the ramifications and refuting the justifications. [1]
(Note: the whitepaper was written quickly, to hit a deadline, so it's not the best. Intro blog post at [2].)
I'm also working on licenses to clarify the legal arguments against Copilot. [3]
I also hope that one of them [4] is a better license than the AGPL, without the virality and applicable to more cases.
Edit: Do NOT use any of those licenses yet! I have not had a lawyer check and fix them. I plan to do so soon.
[1]: https://gavinhoward.com/uploads/copilot.pdf
[2]: https://gavinhoward.com/2021/10/my-whitepaper-about-github-c...
The fast inverse square root algorithm referenced here didn't originate from Quake and is in hundreds of repositories - many with permissive licenses like WTFPL and many including the same comments. It's not really a large amount of material, either.
GitHub claims they haven't found any "recitations" that appeared fewer than 10 times in the training data. That doesn't mean it's a completely solved issue though, since some code may be in many repositories yet always under non-permissive licenses.
> and I would argue that it will not be the case for ML models in general because all ML models like Copilot will keep suggesting output as long as you ask for it. There is no limit to how much output someone can request. In other words, it is trivial to make such models output a substantial portion of the source code they were trained on.
With the exceptions mentioned above, what you get back from asking for more code won't just be more and more of a particular work. Realistically I think you'd be able to get significantly more from Google Books.
Where did it come from then? And what license did the original have?
> and is in hundreds of repositories - many with permissive licenses like WTFPL and many including the same comments.
If the original was GPL or proprietary, then all of this copies with different licenses are violating the license of the original. Just because it exists everywhere does not mean Copilot can use it without violating the original license.
> It's not really a large amount of material, either.
No, but I would argue that it is enough for copyright because it is original.
> GitHub claims they haven't found any "recitations" that appeared fewer than 10 times in the training data.
Key word is "claim". We can test that claim. Or rather, you can, if you have access to Copilot, you can try the test I suggested at https://news.ycombinator.com/item?id=28018816 . Let me know the result. Even better, try it with:
// Computes the index of them item.
map_index(
because what's in that function is definitely copyrightable.> With the exceptions mentioned above, what you get back from asking for more code won't just be more and more of a particular work. Realistically I think you'd be able to get significantly more from Google Books.
That can only be tested with time. Or with the test I gave above.
I think that with time, more and more examples will appear until it is clear that Copilot is a problem.
Nevertheless, a court somewhere (I think South Africa) recently ruled that an AI cannot be an inventor. If an AI cannot be an inventor, why can it hold copyright? And if it can't hold copyright, I argue it's infringing.
Again, only time will tell which of us is correct according to the courts, but I intend to demonstrate to them that I am.
From what I read, the code has been altered and iterated on as it was passed down. The magic number constant is claimed to have been derived by Cleve Moler and Gregory Walsh.
> If the original was GPL or proprietary, then all of this copies with different licenses are violating the license of the original. Just because it exists everywhere does not mean Copilot can use it without violating the original license.
If it was originally proprietary (this predates GPL) I believe the liability would be on whoever took that proprietary code and republished it under MIT/etc.
To be clear, I'm not recommending that you use code you know has been incorrectly licensed. Just that in cases where certain "folk code" is seemingly widely available under permissive terms, Copilot isn't doing much that an honest human wouldn't.
> Key word is "claim". We can test that claim. Or rather, you can, if you have access to Copilot
I don't unfortunately. As a side note, your function already existed in Apache-licensed code. But since it's not in many repositories I'd be willing to bet Copilot won't regurgitate it - I could message around a few people who might be able to try it.
> Nevertheless, a court somewhere (I think South Africa) recently ruled that an AI cannot be an inventor. If an AI cannot be an inventor, why can it hold copyright?
GitHub's intention isn't for Copilot to hold the code's copyright, but for the user to.