First, as much as I don't like the idea of Copilot, it seems to be good for boilerplate code. However, the fact that boilerplate code exists is not because of some natural limitation of code; it exists because our programming languages are subpar at making good abstractions.
Here's an example: in Go, there is a lot of `if err == nil` error-handling boilerplate. Rust decided to make a better abstraction and shortened it to `?`.
(I could have gotten details wrong, but I think the point still stands.)
So I think a better way to solve the problem that Copilot solves is with better programming languages that help us have better abstractions.
Second, I personally think the legal justifications for Copilot are dubious at best and downright deception at worst, to say nothing of the ramifications of it. I wrote a whitepaper about the ramifications and refuting the justifications. [1]
(Note: the whitepaper was written quickly, to hit a deadline, so it's not the best. Intro blog post at [2].)
I'm also working on licenses to clarify the legal arguments against Copilot. [3]
I also hope that one of them [4] is a better license than the AGPL, without the virality and applicable to more cases.
Edit: Do NOT use any of those licenses yet! I have not had a lawyer check and fix them. I plan to do so soon.
[1]: https://gavinhoward.com/uploads/copilot.pdf
[2]: https://gavinhoward.com/2021/10/my-whitepaper-about-github-c...
The fast inverse square root algorithm referenced here didn't originate from Quake and is in hundreds of repositories - many with permissive licenses like WTFPL and many including the same comments. It's not really a large amount of material, either.
GitHub claims they haven't found any "recitations" that appeared fewer than 10 times in the training data. That doesn't mean it's a completely solved issue though, since some code may be in many repositories yet always under non-permissive licenses.
> and I would argue that it will not be the case for ML models in general because all ML models like Copilot will keep suggesting output as long as you ask for it. There is no limit to how much output someone can request. In other words, it is trivial to make such models output a substantial portion of the source code they were trained on.
With the exceptions mentioned above, what you get back from asking for more code won't just be more and more of a particular work. Realistically I think you'd be able to get significantly more from Google Books.