If similar code is open in your VS Code project, Copilot can draw context from those adjacent files. This can make it appear that the public model was trained on your private code, when in fact the context is drawn from local files. For example, this is how Copilot includes variable and method names relevant to your project in suggestions.
It’s also possible that your code – or very similar code – appears many times over in public repositories. While Copilot doesn’t suggest code from specific repositories, it does repeat patterns. The OpenAI codex model (from which Copilot is derived) works a lot like a translation tool. When you use Google to translate from English to Spanish, it’s not like the service has ever seen that particular sentence before. Instead, the translation service understands language patterns (i.e. syntax, semantics, common phrases). In the same way, Copilot translates from English to Python, Rust, JavaScript, etc. The model learns language patterns based on vast amounts of public data. Especially when a code fragment appears hundreds or thousands of times, the model can interpret it as a pattern. We’ve found this happens in <1% of suggestions. To ensure every suggestion is unique, Copilot offers a filter to block suggestions >150 characters that match public data. If you’re not already using the filter, I recommend turning it on by visiting the Copilot tab in user settings.
This is a new area of development, and we’re all learning. I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs. My biggest take-away: LLM maintainers (like GitHub) must transparently discuss the way models are built and implemented. There’s a lot of reverse-engineering happening in the community which leads to skepticism and the occasional misunderstanding. We’ll be working to improve on that front with more blog posts from our engineers and data scientists over the coming months.
Copilot training data should have been sanitized better.
In addition: any code that is produced by copilot that uses a source that is licensed, MUST follow the practices of that license, including copyright headers.
Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”
If someone violated the copyright of a song by sampling too much of it and released it in the public domain (or failed to claim it at all), and you take the entire sample from them, would that hold up in a legal setting? I doubt it.
I will admit that I am conflicted, because I can see some really cool potential applications of Copilot, but I can't say I am not concerned if what Tim maintains is accurate for several different reasons.
Lets say Copilot becomes the way of the future. Does it mean we will be able to trust the code more or less? We already have people, who copy paste stack overflow without trying to understand what the code does. This is a different level, where machine learning seems to suggest a snippet. If it works 70% of time, we will have a new generation of programmers management always wanted.
As I understand, this isn't proven is it?
We don't know that the model isn't simply stitching and approximating back to the closest combination of all the data it saw, versus actually understanding the concepts and logic.
Or is my understanding already behind times?
Given that there have been major concerns about copyright infringements and license violations since the announcement of Copilot, wouldn't it have been better to do some more of this "learning", and determine what responsibilities may be expected of you by the broader community, before unleashing the product into the wild? For example, why not train it on opt-in repositories for a few years first, and iron out the kinks?
All the research suggests that AI-assisted auto-complete merely helps developers go faster with more focus/flow. For example, there's an NYU study that compared security vulnerabilities produced by developers with and without AI-assistend auto-complete. The study found that developers produced the same number of potential vulnerabilities whether they used AI auto-complete or not. In other words, the judgement of the developer was the stronger indicator of code quality.
The bottom line is that your expertise matters. Copilot just frees you up to focus on the more creative work rather than fussing over syntax, boilerplate, etc.
But that is exactly how it works. Translation companies license (or produce) huge corpuses of common sentences across multiple languages that are either used directly or fed into a model.
Third party human translators are asked to assign rights to the translation company. https://support.google.com/translate/answer/2534530
Ha ha. Because then the product couldn’t be built. Better to steal now and ask forgiveness later, or better yet, deny the theft ever occurred.
I mean, in humans it's just referred to as 'experience', 'training', or 'creativity'. Unless your experience is job-only, all the code you write is based on some source you can't attribute combined with your own mental routine of "i've been given this problem and need to emit code to solve it". In fact, you might regularly violate copyright every time you write the same 3 lines of code that solve some common language workaround or problem. Maybe the solution is CoPilot accompanying each generation with a URL containing all of the run's weights and traces so that a court can unlock the URL upon court order to investigate copyright infringement.
> If someone violated the copyright of a song by sampling too much of it and released it in the public domain (or failed to claim it at all), and you take the entire sample from them, would that hold up in a legal setting? I doubt it.
In general you're not liable for this. While you still will likely have to go to court with the original copyright holder's work, all the damages you pay can be attributed to whoever defrauded or misrepresented ownership over that work. (I am not your lawyer)
Morally I'd say you should make a reasonable good faith effort to verify that you have a real license for everything you're using. When you're importing something on the scale of "all of Github" that means a bit more effort than just blindly trusting the file in the repository. When I worked with an F500 we would have a human explicitly review the license of each dependency; the review was pretty cursory, but it would've been enough to catch someone blatantly ripping off a popular repo.
> I mean, in humans it's just referred to as 'experience', 'training', or 'creativity'. Unless your experience is job-only, all the code you write is based on some source you can't attribute combined with your own mental routine of "i've been given this problem and need to emit code to solve it". In fact, you might regularly violate copyright every time you write the same 3 lines of code that solve some common language workaround or problem.
Aren't you moving the goal posts? This is not 3 lines, but instead is 1 to 1 reproducing a complex function that definitely has enough invention height to be copyright able.
1. you make it out like a translation from e.g. English to Spanish wouldn't fall under copyright. That's incorrect, in most juristictions I am aware of, it actually fall under the copyright of the original work and fall under its own copyright.
2. When will copilot be released open source, it is pretty clear by now that it is a derivative of all the OSS code so how about following the licensing?
If someone has lied about the license of something down the chain of links, he's the one on the hook for it.
If you have licensed code in your software and no license to show for it or cannot produce the link to it then you're on the hook.
And here's the issue at hand copilot must have seen that code under permissive license somewhere, but now cannot produce a link to it.
You’re really not going to solve this problem with marketing (“blog posts”) or some pro-Github story from data scientists. You need a DMCA / removal request feature akin to Google image search and you need work on understanding product problems from the customer perspective.
They were highly skilled laborers who knew how to operate complex looms. When auto looms came along, factory owners decided they didn't want highly trained, knowledgeable workers they wanted highly disposable workers. The Luddites were happy to operate the new looms, they just wanted to realize some of the profit from the savings in labor along with the factory owners. When the factory owners said no, the Luddites smashed the new looms.
Genuinely, and I'm not trying to ask this with any snark, do you view the work you do as similar to the manufacturers of the auto looms? The opportunity to reduce labor but also further the strength of the owner vs the worker? I could see arguments being made both ways and I'm curious about how your thoughts fall.
Thank you for your input.
I'd like you to inspect the issue and explain what happened and why (and start to fix that if that's not intended) rather than sharing what you think could have happened.
Unless you're not in position to do that, in which case it doesn't matter you're on the Copilot team (anyone can throw hypotheses like that).
Please also don't tell me we're at the point where we can't tell why AI works in a particular way and we cannot debug issues like this :-(
The vast majority of people who would use a matrix transform function they got from code completion (or from a GitHub or stack overflow search) probably don’t care what the license is. They’ll just paste in the code. To many developers publicly viewable code is in the public domain. Code pilot just shortens the search by a few seconds.
Microsoft should try todo better (I’m not sure how), but the sad fact is that trying to enforce a license on a code fragment is like dropping dollar bills on the sidewalk with a note pinned to them saying “do not buy candy with this dollar”
Things turned out pretty great economy-wise for people in the UK. So that's a poor example even if Luddites didn't hate technology. Not working on the technology wouldn't have done the world any favours (nor the millions of people who wore the more affordable clothes it produced).
I personally think it'd be rewarding to make developers lives easier, essentially just saving the countless hours we spend googling + copy/pasting Stackoverflow answers.
Co-pilot is merely just one project in this technological development, even if a mega-corp like Microsoft doesn't do it ML is here to stay.
If you're concerned that software developers job security is at all at risk from co-pilot than you greatly misunderstand how software engineering works.
Auto-completing a few functions you'd copy/paste otherwise (or rewrite for the hundredth time) is a small part of building a piece of software. If they struggle with self-driving cars, I think you'll be alright.
At the end-of-the-day there's a big incentive for Github et al to solve this problem, a class action lawsuit is always an overhanging threat. Even if co-pilot doesn't make sense as a business and these pushback shut it down I doubt it will go away.
I'm personally confident the industry will eventually figure out the licensing issues. The industry will develop better automated detection systems and if it requires more explicit flagging, no-one is better positioned to apply that technologically than Github.
The statement that language models actually understand syntax and semantics is still subject of significant debate. Look at all discussion around "horse riding astronaut" for stable diffusion models and the prompts with geometric shapes which clearly show that the language model does not semantically understand the prompt.
It doesn't change licensing issue but it does mean people are already copying and using copyrighted code without respecting original license and no AI involved.
There should be a way to reverse engineer code LLMs to see which core bits of memorized code they build on. Another complex option is a combination of provenance tracking and semantic hashing on all functions in code used for training. Another option (non-technical) is a rethinking of IP.
If CoPilot makes everyone see how ridiculous that is, that's a win in my book.
Instead, they scoured and plagiarized everyone's source code without their consent.
I look forward to the entire product you have made being available, as is required for any product built using gpl3’d software.
The original poster said it was in a private repository.
>It doesn't change licensing issue but it does mean people are already copying and using copyrighted code without respecting original license and no AI involved.
I don't get the argument. Many people are copying/pirating MS windows/MS office. What do you think MS would say to a company they caught with unlicensed copies and they used the excuse "the PCs came preinstalled with Windows and we didn't check if there was a valid license"?
What if a particular piece of code is licensed restrictively, and then (assuming without malice) accidentally included in a piece of software with a permissive license?
What if a particular piece of code is licensed permissively (in a way that allows relicensing, for example), but then included in a software package with a more restrictive licence. How could you tell if the original code is licensed permissively or not?
At what point do Github have to become absolute arbiters of the original authorship of the code in order to determine who is authorised to issue licenses for the code? How would they do so? How could you prove ownership to Github? What consequences could there be if you were unable to prove ownership?
That's before we even get to more nuanced ethical questions like a human learning to code will inevitably learn from reading code, even if the code they read is not permissively licensed. Why then, would an AI learning to code not be allowed to do the same?
In this case, all you have on them is an email address. Pretty sure you're still on the hook.
There is no “I don’t know who owns the IP” defense: the image has a copyright, a person owns that copyright, publishing the image without licensing or purchasing the copyright, is a violation. The fine is something like $100k per offense for a business.
See https://en.m.wikipedia.org/wiki/Peterloo_Massacre for example
If we hold reproductions of a single repository to a certain standard, the same standard should probably apply to mass reproductions. For a single repository, it’s your responsibility to make sure it’s used according to the license.
Are there slightly gray edge cases? Of course, but they’re not -that- grey. If I reproduced part of a book from a source that claimed incorrectly it was released under a permissive license, I would still be liable for that misuse. Especially if I was later made aware of the mistake and didn’t correct it.
If something is prohibitively difficult maybe we should sometimes consider that more work is required to enable the circumstances for it to be a good idea, rather than starting from the position that we should do it and moulding what we consider reasonable around that starting assumption.
Sure glad thse Luddites didn't get their way
Of course if someone does manage to set a precedent that including copyrighted works in AI training data without an explicit license to do so, GitHub Copilot would be screwed and at best have to start over with a blank slate if they can't be grandfathered. But this would affect almost all products based on the recent advancements in AI and they're backed by fairly large companies (after all, GitHub is owned by Microsoft and a lot of the other AI stuff traces back to Alphabet and there are a lot of startups funded by huge and influential VC companies). Given the US's history of business-friendly legislation, I doubt we'll see copyright laws being enforced against training data unless someone upsets Disney.
I think you are vastly underestimating how many professionally employed software developers are replaceable by copilot at this very moment. The managers are not caught up yet and you seem to be lucky not having to work with this type of dev, but I have had 1000s of people I interacted with in a professional capacity over the decades who can be replaced today. Some of those realised this and moved to different positions (for instance, advising how to use ML to replace them: if you cannot beat them…).
I mean of course you are right in general but there are millions of ‘developers’ who just look everything up with Google/SO, copy paste and change until it works. You are saying this will make their lives better, I say it will terminate their employment.
Anecdote: I know a guy who makes a boatload of money in London programming but has no understanding of things like classes, functional constructs, functions, iterators (he kind of, sometimes, understands loops) etc. He simply copies things and changes them until it works: he moved to frontend (react) as there he is almost not distinguishable from his more capable colleagues because they are all in a ‘put code and see the result’ type of mode anyway and all structures look the same in that framework, so the skeleton function, useXXX etc is all copy paste mostly anyway.
This is why I'm gnashing my teeth whenever I hear companies being fine with their employees using Copilot for public-facing code. In terms of liability, this is like going back from package managers to copying code snippets of blogs and forum posts.
If you do something, it's ultimately you who has to make sure that it is not against the law. "I didn't know" is never a good defense. If you pay with counterfeit cash, it is you who will be arrested, even if you didn't know it was counterfeit. If you use code from somewhere else (no matter if it's by copy/pasting or by using Copilot), it is you who has to make certain that it doesn't infringe on any copyright.
Just because a tool can (accidentally) make you break the law, doesn't mean the tool is to blame (cf. BitTorrent, Tor, KaliLinux, ...)
How is that a solution though? OP isn't upset that he's regenerated his own work via Copilot, he's upset that others can unknowingly & without attribution.
Music licensing is bonkers but AFAIR (at least in the UK) I think you're allowed to do covers without explicit permission[1] - you'll have to give the original writers/composers the appropriate share of any money you make.
[1] Which is why you (used to?) get, e.g., supermarkets playing covers of songs rather than the originals because it's cheaper.
I'm also not sure that Copilot is just reproducing code, but that's a separate discussion.
> If I reproduced part of a book from a source that claimed incorrectly it was released under a permissive license, I would still be liable for that misuse. Especially if I was later made aware of the mistake and didn’t correct it.
I don't believe that's correct in the first instance (at least from a criminal perspective). If someone misrepresents to you that they have the right to authorise you to publish something, and it turns out they don't have that right, you did not willingly infringe and are not liable for the infringement from a criminal perspective[1]. From a civil perspective, likely the copyright owner could still claim damages from you if you were unable to reach a settlement. A court would probably determine the damages to award based on real damages (including loss of earnings for the content creator), rather than anything punitive if it's found that
Further, most jurisdictions have exceptions for short extracts of a larger copyrighted work (e.g. quotes from a book), which may apply to Copilot.
This is my own code, I wrote it myself just now. Can I copyright it?
``` function isOdd (num) { if (num % 2 === 0) { return true; } else { return false; } } ```
What about the following:
``` function isOddAndNotSunday (num) { const date = new Date(); if (num % 2 === 0 && date.getDay() > 0) { return true; } else { return false; } } ```
Where do we draw the line?
[0]: https://docs.github.com/en/site-policy/github-terms/github-t... [1]: https://www.law.cornell.edu/uscode/text/17/506
Why this restriction on public-facing code? Are you OK with Copilot being used for "private"/closed source code? I get that it would be less likely to be noticed if the code is not published, but (if I understand right) is even worse for license reasons.
So I had something similar happen to the OP a couple of days ago. I'm on friendly terms with a competing codebase's developer and have confirmed the following with them, both mine and it are closed source and hosted on github.
Halfway through building something I was given a block of code by copilot, which contained a copyright line with my competitors name, company number and email address.
Those details have never, ever been published in a public repository.
How did that happen?
Then they decided to wade in and build a house of cards where the cards are everyone else’s code, just waiting for the grenade pin puller and we’ve potentially witnessed the moment?
That’s the only thing that makes sense to me here. They don’t care because opening the issue will bring down everyone else with them.
I will admit I’m kind of a “throw stuff at the wall and see what sticks” kind of coder but nobody is paying me boatloads of money to poke at some program until it stops segfaulting, would be nice though.
The first C developers wrote C code despite lacking a training set of C code.
AI can't do that. It needs C code to write C code.
See the difference here?
But also this point is silly. Plenty of money and effort is risked and lost with no bailout. Bailouts are extremely unusual in the grand scheme of things.
Since copilot famously outputs GPL covered code… no, we have proof they didn't do that.
Similarly, the reason Europe put 30% of its populace "out of work" by industrialising agriculture is why we don't have to all go work in fields all day. It is a massive net positive for us all.
Moving ice from the arctic into America quickly enough before it melted was a big industry. The refrigerator put paid to that, and improved lives the world over.
Monks retained knowledge through careful copying and retransmission of knowledge during the medieval times in the UK. That knowledge was foundational in the incredible acceleration of development in the UK and neighbouring countries in the 18th and 19th centuries. But the printing press, that rendered those monks much less relevant to culture and academia, was still a very good idea that we all still benefit from today.
Soon, millions of car mechanics who specialise in ICE engines will have to retrain or, possibly, just be made redundant. That may be required for us to reduce our pollution output by a few percent globally, and we may well need to do that.
The exact moment in history when workers who've learned how to do one job are rendered obsolete is painful, yes, and they are well within their rights to what they can to retain a living. But that doesn't mean those workers are somehow right; nor that all subsequent generations should have to delay or forego the life improvement that a useful advance brings, nor all of the advances that would be built on that advance.
I'm concerned that "draw context from" is a euphemism. Does it mean it uses code that's only on your laptop to train its AI?
The most simple answer would be that this is false, it was published somewhere but you are not aware of it.
This would basically kill github as an idea. I like the ability to be able to push some personal project to github and don't really give a fuck about technical copyright violations and I think the same is true for 90% of developers.
For example, if I copy pasted code from someone in my open source project, and the copied code was subjected to required attribution will Copilot keep that attribution when it copies my code again?
If you write some code and release it under the GPL. Then I take your code, integrate it into my project, and release my project with the MIT licence (for example), it may be that Copilot was only trained on my repo (with the MIT licence)
The fault there is not on Github, it's on me. I was the one who incorrectly used your code in a way that does not conform to the terms of your licence to me.
I don't think the fact that Copilot outputs code which seems to be covered under the GPL proves that Github did not only crawl repositories with permissive licences when training Copilot.
export default class USERCOMPONENT extends REACTCOMPONENT<IUSER, {}> {
constructor (oProps: IUSER){
super(oProps);
}
render() {
return (
<div>
<h1>User Component</h1>
Hello, <b>{This.oProps.sName}</b>
<br/>
You are <b>{This.oProps.dwAge} years old</b>
<br/>
You live at: <b>{This.oProps.sAddress}</b>
<br/>
You were born: <b>{This.oProps.oDoB.ToDateString()}</b>
</div>
);
}
}I find it very hard to believe you didn't understand the suggestion.
Github can only trust push timestamps.
IT Crowd Piracy Warning https://www.youtube.com/watch?v=ALZZx1xmAzg
A real good example is mapping objects: let’s say you have a deep nested object from an ERP and you need to map that to another system(s). This is horrible work and copilot just generates almost everything for it if it knows the input and output objects; it ‘knows’ that address = street and if it is not it will deduct it from the models or comments or both; if there is a separate house number and stuff, it’ll generate code to translate that. I used to hire people for that; no longer; it just pops, I run the tests and fix some thing here and there.
Stealing, scamming, gambling, inheriting, collecting interest, price gouging, slavery, underpaying workers, supporting laws to undermine competitors… Plenty of ways to make money without being useful—or by being actively harmful—to someone else.
> Almost all of the clothing industry companies make money from large numbers of people buying their clothes. So they are useful to us.
We don’t need all that clothing, made by monetarily exploiting people in poor countries and sold by emotionally exploiting people in rich countries under the guise of “fashion”. The usefulness line has long been crossed, it’s about profit profit profit.
So you write tests and copilot generates code you shove into production with little overhead ?
Do you read the code thoroughly (kind of negating having it generated for you?), or just have blind faith in it because tests are green and just YOLO it into production ?
I'd feel pretty uneasy deploying code that:
* I, or a trusted peer has not written.
* Hasn't been reviewed by my peers.
* Code I, or my peers don't understand fairly well.
That's not to say I think me or my colleagues write code that doesn't have problems, but I like to think we at least understand the code we work with and I believe this has benefits beyond just getting stuff done quickly and cheaply.In other words, I have no problem using code generated by co-pilot, but I'd feel the need to read and review it quite thoroughly and then I sort of feel that negates the purpose, and it also means it pulls my back into the role of doing work I'd hire someone else to do.
Most highly qualified workers loves what they do and would stand for keeping they’re output quality up. On the contrary interchangeable cheap workers have no real incentive to that. The factory’s manager is left alone in charge to balance quality versus cheapness, and the last comes with obsolescence (planned or not), which is good for business.
The best way to be transparent about a software implementation is to open source the thing. If that's your take away, this is the only logical thing to do. Blogs posts would be appreciated but are not enough. We can only trust what you say, we cannot verify anything.
Sadly that's probably a modern thing and not something that people wanted / cared about immediately once everyone lost their jobs.
If that is true then one way to get around copyright restrictions on existing code is to create a new language.
Sure the legal framework can change, but such profound change will have surely many consequences we won't foresee, for good or bad.
There you have the "most responsible way".
The GPL should be updated to prohibit code to be used for "learning" (i.e., regurgitating copyrighted fragments).
Isn't this basically all UI programming? :D
Joking aside, I see this 'person X doesn't know anything, but they are still delivering' attitude quite a bit on HN now. They clearly know something, and projects like co-pilot will make them even more effective.
I think the opposite of you - that projects like co-pilot will further lower the barriers of entry to programming and expand those who program. I also think that like all ease of programming advances in the past, business requirements will continue to grow at the edges where those who care about the craft will still be required.
Most of the time when it’s made it’s just papering over yer another situation where a surplus is being squeezed out of a transaction by a parasitic manager class using principal-agent problem dynamics.
The people who invented this stuff are always trying to tell you they’ve invented the cotton gin or something when in fact they’ve just come up with a clever way to take someone else’s work and exploit it.
only emotionally crippled people like fashion, if they were healthy they would all dress in gray unitards and march in formation towards the glorious future!
hey I too have often been carried away by my own rhetoric but come on!
From an AI safety perspective, I'm also worried it will accelerate the transition to self-learning code, ie. the model both generating and learning from source code, which is a crucial step on the way to general artificial intelligence that we are not ready for.
AI could be used to create languages based on design criteria and constraints like C was, but it does bring up the question of why one of the constraints should be character encodings from human languages if the final generated language would never be used by humans...
I mainly think it's funny watching all of these Rand'ian objectivists reusing ever excuse used by every craftsman that was excised from working life...machines need a machinist, they don't have souls or creativity, etc.
Industry always saw open source as a way to cut cost. ML trained from open source has the capability to eliminate a giant sink of labor cost. They will use it to do so. Then they will use all of the arguments that people have parroted on this site for years to excuse it.
I'm a pessimist about the outcomes of this and other trends along with any potential responses to them.
I think the real lesson to learn is if you look at the sheer amount of energy (wattage) used to replace humans it's clear that brains are really calorie efficient at doing things like producing the kinds of code that Copilot creates...but it doesn't matter because eliminating labor cost will always be attractive no matter what the up front cost is to do it. They literally can't NOT do it based on the rules of our game.
If it wasn't MS it would be someone else and is...you think IBM isn't doing this? Amazon? GTFOH. So is every other large company that has a pool of labor that is valued as a cost.
Maybe a better question would be how and why major parts of human life are organized in ways that are bad for the bulk of humanity.
Being a new area of development doesn't release you from your obligation to make sure what you're doing is ethical and legal FIRST.
> I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs.
And yet oddly nowhere did the phrase "I reached out to OP to discuss with with them" appear anywhere in your response." Nope. Being part of GitHub's infamous Social Media "incident response" team was more important than actually figuring out what was going on.
You don't even say that you will look into the situation with OP, or speak to them.
waves to all the github employees who will be reading this comment because someone on Github's marketing team links to it
First consider that you made a mistake yourself, _then_ ask, whether the fault could be on the other side. I really dislike this high-horse down-talking tone. Maybe it was not meant to sound like that, maybe this kind of talk has become a habit without noticing. Lets assume that, giving a benefit of a doubt.
Onto the actual matter:
> If similar code is open in your VS Code project, Copilot can draw context from those adjacent files. This can make it appear that the public model was trained on your private code, when in fact the context is drawn from local files. For example, this is how Copilot includes variable and method names relevant to your project in suggestions.
How comes, that Copilot hasn't indicated, where the code came from? How can it ever seem, like the code came from elsewhere? That is the actual question. We still need Copilot to point us to repositories or snippets on Github, when it suggests copies of code (including just renaming variables). Otherwise the human is taken out of the loop and no one is checking copyright infringements and license violations. This has been requested for a long time. Time for Copilot to actually respect rights of developers and users of software.
> It’s also possible that your code – or very similar code – appears many times over in public repositories.
So basically it propagates license violations. Great. Like I said, the human needs to be kept in the loop and Copilot needs to empower the user to check where the code came from.
> This is a new area of development, and we’re all learning.
The problem is not, that this is a new development or that we are all learning. That is fine. Sure, we all need to learn. However, when there is clearly a problem with how Copilot works, it is the responsibility of the Copilot development team to halt any further violations and first fix that problem, before letting the train roll on and violating more people's rights. The way this is being handled, by just shrugging and rolling on, maybe at some point fixing things, is simply not acceptable.
[0] https://www.statista.com/statistics/817918/number-of-busines...
I have lower expectations of the rigor with which companies police their internal codebases, though. Seeing Copilot banned for internal use too is a pleasant surprise. Companies tend to be a lot more "liberal" in what kind of legal liabilities they accept for their internal tooling in my experience.
This claim rings extremely hollow when your team refuses to do any of the obvious things that developers, experts and community stakeholders in this very thread (and the rest of this website) are telling you. You still haven't open-sourced Copilot. You still haven't trained it on Microsoft internal code such as Windows and Office. You still haven't made the model freely available for anyone to run locally. Until you do any of these things, you are not acting in the interest of the community and you are just exploiting people and their code for your own profit.
Is this the tact your organization would take if someone else’s code completion software was generating proprietary Microsoft’s proprietary code?
There are statutory damages on top of your actual damages. $50k per act of infringement. No reason for the copyright holder to settle for less when it's an open and shut case.
> Further, most jurisdictions have exceptions for short extracts of a larger copyrighted work (e.g. quotes from a book), which may apply to Copilot.
Quotes do not automatically get an exception just because they're taken from a larger work, they might be excepted either because they were de minimis (essentially because they were too short to be copyrightable) or because they were fair use (which is a complex question that takes into account the purpose and context, which Copilot is very unlikely to satisfy because it's not quoting other code for the purpose of saying something about it).
> Where do we draw the line?
Circuit specific; some but not all circuits use the AFC test. It sounds like this code was both long enough and creative/innovative enough to be well on the wrong side of it though.
It’s not a binary all perfectly or nothing at all. The law looks at intent and so doesn’t punish mistakes or errors so long as you aren’t being malicious or reckless or negligent.
They should definitely include disclaimers and make seeding opt-in (though I don't know how safe you are legally when you download a Lion King copy labeled Debian.iso). That said, they don't have the information necessary to tell whether what you're doing is legal or not.
Copilot _has_ that information. The model spits out code that it read. They could disallow publishing or commercially using code generated by it while they're sorting it out, but they made the decision not to.
AI is hard, but the model is clearly handing out literal copies of GPL code. Github knows this and they still don't tell you about it when you click install.
As I understand it, the complainant may CHOOSE to request the court to levy statutory damages rather than actual damages at any point, but is not entitled to both actual AND statutory (17 U.S. Code § 504)
It also seems to be absolutely capped at 30K per infringement, not 50, and ranges up from $750. It also seems that if the "court finds, that such infringer was not aware and had no reason to believe that his or her acts constituted an infringement of copyright, the court in its discretion may reduce the award of statutory damages to a sum of not less than $200."
I think you are probably right that this specific function is copyrightable though, but taken overall, I think Microsoft's lawyers have probably concluded that they would win any challenge on this. Microsoft have lost court battles before though, so who knows?
But they ain't some kind of special villains, its today's monopoly market kicked in. Selling startuprs to Yahoo comes with consequences.
> capable of laundering open source code That's an exaggeration. Copilot is still a dumb machine which accidentally learned to mimic the practice of borrowing intellectual property from human coders.
I don't equate, say, "making money" with "stealing money". I mean the way people do things within the law. Inheriting is different; the money is already made. Interest is being useful to someone else, via the loan of capital.
the problem isn't even that this technology will eventually replace programmers: the problem is that it produces parts of the training set VERBATIM, sans copyright.
No, I am pretty optimistic that we will quickly come to a solution when we start using this to void all microsoft/github copyright.
No, that's not true. Capitalists make money from simply owning things, not because they're necessarily doing anything useful.
How has your team defined, specified and clearly articulated these issues with generation?
How do you test your generation to distinguish between fixing a problem vs reducing obvious true positives (i.e. unintentionally making the problem less visible without eliminating it)?
Without some communication on those fronts (which maybe I've just not seen yet), I'm not surprised that you get pushback against your product from people feel like you're taking a cover-our-ass-and-YOLO approach.
Maybe not right this moment but our actions have consequences in the future.
For those who only see the next quarter, they're stoked.
For those who understand infinite growth is impossible and would simply like a livable world, they're horrified.
Currently, everything is extraction and the US is rotting from the inside out because of it.
Laws shouldn't be equated to ethics. There have been and will be countless ways to make money legally and unethically in any society.
This is just fear mongering, the same exact thing can happen with a web browser, I click a link to view an image of a cat but... oops, it was actually a Getty copyrighted picture of a dog! Oh nooooo.
On the web that sort of thing is actually common, but bit torrent? I have never downloaded a torrent to find it was something other than what I expected. Never have I seen a movie masquerading as a Debian ISO. That's nothing more than a joke people use to make light of their (deliberate) copyright infringement.
Furthermore, is there even any bit torrent client that will recommend copyrighted content to you, rather than merely download what you tell it to? I've not seen one. Search engines, in my browser, do that sort of recommendation but bit torrent clients do what I tell them to. Including seeding to others, which is optional but recommended for obvious reasons.
[1] Things like "was it on the radio or a TV show or a live performance or a recording? who was the composer? which licensing region was it in?" etc.
Proposition: "They don't use private code".
Proof: "They said they don't use private code. Either the private code appearing is published somewhere else, or they are using private code. Lying would be bad. Therefore the code is published somewhere else, and they don't use private code".
Did you read the comment you're replying to at all? It says
>The Luddites were happy to operate the new looms, they just wanted to realize some of the profit from the savings in labor along with the factory owners.
Now maybe you agree maybe you disagree. But if you're just talking past the person you're replying to... what's the point?
In other words: things improved because of technology and despite the societal/economic framework, not because of it.
And how many workers even have the possibility of an arrangement like this, i.e. a worker-owned cooperative?
Yes, that is exactly the point. When a labour-saving technological development comes along, it's payday to the capital-having class and dreary times for the labour-doing class.
>hey I too have often been carried away by my own rhetoric but come on!
Because that's what people want. You can get high quality clothes for much cheaper than you could in 1816, but people prefer disposable clothes so they can change their look more often. This is just producers responding to demand.
Sorry, what?
Downloading copyrighted content is very, very rarely the problem.
It's the uploading (the sharing!) of copyrighted content where you actually get into trouble.
But more to the point, getting tricked into seeding a copyrighted movie by a torrent masquerading as a Debian ISO isn't something that actually happens. That's absurd FUD.
If your hope is that saying "it came out of our ML model" somehow removes Copilot from the well-established legal framework of licensing, I think you're wrong, and you are creating a minefield that I and others choose to stay well clear of. The revenue from Copilot, and the rest of MS, can probably pay your legal bills, but certainly not mine.
J. Random Hacker acquires and uses a copy of some of GitHub's, or Microsoft's source. When sued, the defense says that the code was not taken directly from GH/MS, just copied from a newsgroup where it had been posted. Does this get J. off the hook?
As a thought experiment, if one were to train a model on purely leaked and/or stolen source code, would the use of model step effectively "launder" the code and make later partial reuse legit?
> "This is just fear mongering, the same exact thing can happen with a web browser, I click a link to view an image of a cat but... oops, it was actually a Getty copyrighted picture of a dog! Oh nooooo."
No-one cares whether you download an open-sourced photo of a cat or a copyrighted photo of a dog.
Why would anyone claim that?
It's a terrible comparison to torrents.
Please don’t straw man¹. That’s neither what I said, nor what intended to convey, nor what I believe.
If this can leak so easy, it makes me wonder how safe api keys are. They are supposed to be hidden away, we know, but so is proprietary code.
The examples considered that: gambling, collecting interest, price gouging, underpaying workers, supporting laws to undermine competitors.
I’m not saying they’re intentionally lying, but that one possible explanation is it looking through non public repositories
They also built a program that outputs open source code without tracking the license.
This isn't a human who read something and distilled a general concept. This is a program that spits out a chain of tokens. This is more akin to a human who copied some copywritten material verbatim.
> No provider or user of an interactive computer service shall be treated as the publisher or speaker of any information provided by another information content provider
So the act of hosting copyrighted content is not actually a copyright violation for Github. They're not obligated to preemptively determine who the original copyright owner of some piece of code is, as they're not the judge of that in the first place. Even if you complain that someone stole your code, how is Github supposed to know who's lying? Copyright is a legal issue between the copyright holder and the copyright infringer. So the only thing Github is required to do is to respond to DMCA takedown notices.
A car has all the information that it's going faster than the speed limit, or that it just ran a red light. But in the end it's the driver who is responsible. It's not the tool (car, Copilot) that commits the illegal act, it's the user using that tool
I can't say for sure about copilot but in general you don't have that kind of information. The problem is a bit like trying to add debug symbols back to some highly optimized binary program.
I'm from the UK, and we used to make motorbikes. They got - correctly - outcompeted by Japanese bikes in the 1950s that were built with more modern investment and tooling. If Japan hadn't done that, we'd have more motorcycle jobs in the UK, and terrible motorcycles that still leaked oil because the seam of the crankcase would still be vertical and not horizontal.
I'm not saying anything about this process is perfect and pain-free, but it seems that a lot of the things we have now are because of processes like this. Should Tesla sell through dealerships instead of direct to consumers? I think the answer is, "Tesla should do what's best for its customers", and not "Tesla should act to keep dealership jobs and not worry about what's best for its customers."
Businesses exist for their customers and not their employees, and having just been part of a business that, shall we say, radically downsized, I've seen a little of the pain of that. Thankfully it was a high tech business, and as the best employment protection is other employers, and there are loads of employers wanting tech skills I've seen my great colleagues all get new jobs. But I think it's ultimately disempowering to think of your employer like a superior when it should feel like an equal whose goals happen to coincide with yours for a while.
Can you elaborate on this? How can I become a capitalist so all my possessions start earning me money?
Proposition: "They either do not use private code or they did something very very stupid."
Proof: "Not using private code is very easy (for example google does not train its models on workspace users' data, which is why they get inferior features) and they promised multiple time not to use private code so doing in would be hard to justify"
Like I said; it is a great thing for me but I don’t believe developers without talent and/or rigorous foundations will make it. Go on Upwork and try to find someone who can do more than the same work (mostly copy paste) that they always did. In an interview when you ask someone to use map/reduce to create a map/dict, they will glaze over. This is the norm, not the exception, no matter the pay. Some of them have 10 years experience but cannot do anything else than make crud pages. This will end as copilot makes lovely .reduce and linq art from a human language prompt.
Capitalists make money from simply owning things, but that doesn't imply in the slightest that everything that can be owned produces income.
The classic example is a landlord: he collects income because he simply owns the land others need or want to use. He doesn't necessarily have do any work that's useful to anyone else, not even maintenance or "capital allocation."
Genuine question, not being snarky.
It is still your responsibility to know and obey the traffic laws, the same as it is your responsibility to obey the copyright laws....
Gambling - I don't do it, but I'd need more specifics to see why gambling is bad in this sense. It's a voluntary pursuit that I think is a bad idea, but that doesn't make it illegal.
Price gouging is still being useful, just at a higher price. Someone could charge me £10 for bread and if that was the cheapest bread available, I'd buy it. If it is excessive and for essential goods, it is increasingly illegal, however. 42 out of 50 states in the US have anti-gouging laws [0], which, as I say, isn't what I'm talking about. I'm talking about legal things.
Underpaying workers - this certainly isn't illegal, unless it's below minimum wage, but also "underpaying" is an arbitrary term. If there's a regulatory/legal/corrupt state environment in which it's hard to create competitors to established businesses, then that's bad because it drives wages down. Otherwise, wages are set by what both the worker and employer sides will bear. And, lest we forget, there is still money coming into the business by it being useful. Customers are paying it for something. The fact that it might make less profit by paying more doesn't undermine that fundamental fact.
As for supporting laws to undermine competitors, that is something people can do, yes. Microsoft, after their app store went nowhere, came out against Apple and Google charging 30% for apps. Probably more of a PR move than a legal one, but businesses trying to influence laws isn't bad, because they have a valid perspective on the world just as we all do, unless it's corruption. Which is (once more, with feeling) illegal, and so out of scope of my comment. And again, unless the laws are there to establish a monopoly incumbent, which is pretty rare, and definitely the fault of the government that passes the laws, the company is still only really in existence because it does something useful enough to its customers that they pay it money.