If someone came to you and said "good news: I memorized the code of all the open source projects in this space, and can regurgitate it on command", you would be smart to ban them from working on code at your company.
But with "AI", we make up a bunch of rationalizations. ("I'm doing AI agentic generative AI workflow boilerplate 10x gettin it done AI did I say AI yet!")
And we pretend the person never said that they're just loosely laundering GPL and other code in a way that rightly would be existentially toxic to an IP-based company.
The reality is that programmers are going to see other programmers code.
I don't think anyone who's not monetarily incentivize to pretend there are IP/Copyright issues actually thinks there are. Luckily everyone is for the most part just ignoring them and the legal system is working well and not allowing them an inch to stop progress.
Content on StackOverflow is under CC-by-sa, version depends on the date it was submitted: https://stackoverflow.com/help/licensing . (It's really unfortunate that they didn't pick license compatible with code; at one point they started to move to the MIT license for code, but then didn't follow through on it.)
Sure it’s a big hill to climb in rethinking IP laws to align with a societal desire that generating IP continue to be a viable economic work product, but that is what’s necessary.
Why do you think that about people who disagree with you? You're responding directly to someone who's said they think there's issues, and not pretending. Do you think they're lying? Did you not read what they said?
And AFAICT a lot of other people think similarly to me.
The perverse incentives to rationalize are on the side of the people looking to exploit the confusion, not the people who are saying "wait a minute, what you're actually doing is..."
So a gold rush person claiming opponents must be pretending because of incentives... seems like the category of "every accusation is a confession".
This is far from settled law. Let's not mischaracterize it.
Even so, an AI regurgitating proprietary code that's licensed in some other way is a very real risk.
So. Yes, technically possible. But impossible by accident. Furthermore when you make this argument you reveal that you don't understand how these models work. They do not simply compress all the data they were trained on into a tiny storable version. They are effectively multiplication matrices that allow math to be done to predict the most likely next token (read: 2-3 Unicode characters) given some input.
So the model does not "contain" code. It "contains" a way of doing calculations for predicting what text comes next.
Finally, let's say that it is possible that the model does spit out not entire works, but a handful of lines of code that appear in some codebase.
This does not constitute copyright infringement, as the lines in question a) represent a tiny portion of the whole work (and copyright only protecst against the reduplication of whole works or siginficant portions of the work), and B) there are a limited number of ways to accomplish a certain function and it is not only possible but inevitable that two devs working independently could arrive at the same implementation. Therefore using an identical implementation (which is what this case would be) of a part of a work is no more illegal than the use of a certain chord progression or melodic phrasing or drum rhythm. Courts have ruled about this thoroughly.
If you have code that happens to be identical to some else's code or implements someone's proprietary algorithm, you're going to lose in court even if you claim an "AI" gave it to you.
AI is training on private Github repos and coughing them up. I've had it regurgitate a very well written piece of code to do a particular computational geometry algorithm. It presented perfect, idiomatic Python with perfect tests that caught all the degenerate cases. That was obviously proprietary code--no amount of searching came up with anything even remotely close (it's why I asked the AI, after all).
They can have a moral view that AI is "stealing" but they are claiming there is actually a legal issue at play.
Not for a dozen lines here or there, even if it could be found and identified in a massive code base. That’s like quoting a paragraph of a book in another book, non infringing.
For the second half of your comment it sounds like you’re saying you got results that were too good to be AI- that’s a bit “no true Scotsman”, at least without more detail. But implementing an algorithm, even a complex one, is very much something an LLM can do. Algorithms are much better defined and scoped natural language, and LLMs do a reasonable job of translating to languages. An algorithm is a narrow subset of that task type with better defined context and syntax.
You're certainly correct. It's also true that companies are going to sue over it. There's no reason to make yourself an easy lawsuit target, if it's trivial to avoid it.
Judge Alsup, in his ruling, specifically likened the process to reading text and then using the knowledge to write something else. That’s training and use.
Well, AI can perhaps solve the problem it created here: generated IP with AI is much cheaper than with humans, so it will be viable even at lower payoffs.
Less cynical: you can use trade secrets to protect your IP. You can host your software and only let customers interact with it remotely, like what Google (mostly) does.
Of course, this is a very software-centric view. You can't 'protect' eg books or music in this way.
Also publishing pirated IP without any monetary gain to yourself also used to be treated more leniently.
Of course, all the rules were changed (both in law and in interpretation in practice) as file sharing became a huge deal about two decades ago.
Details depend on jurisdiction.
I don't want my children to pay a license fee to their school or their textbook publishers for what they learn in school.
It's easier for the LLM to rewrite an idiomatic computational geometry algorithm from scratch in a language it understands well like Python. Entire computational geometry textbooks and research papers are in its knowledge base. It doesn't have to copy some proprietary implementation.
There is an entire research field of scientific discovery using LLMs together with sub-disciplines for the various specialization. LLMs routinely discover new things.
Yes, the training of the model itself is (or should be) a transformative act so you can train a model on whatever you have legal access to view.
However, that doesn't mean that the output of the model is automatically not infringing. If the model is prompted to create a copy of some copyrighted work, that is (or should be) still a violation.
Just like memorizing a book isn't infringment but reproducing a book from memory is.
(I find the example of the computational geometry algorithm being a clear case of direct memorization not very compelling, in any case.)
The amount of IP risk caused by USING (not training) AI models to produce code, especially wholesale commercial code that competes with code that was contained in the training data, is poorly understood.
So, for the specific case of material contributed to StackOverflow on or after 2018-05-02, it's possible to use it under GPLv3 (including appropriate attribution), so any project compatible with GPLv3 can copy it with attribution. Any material before that point is not safe to copy.
It's potentially non-infringing in a book if you quote it in a plausible way, and properly.
If you copy&paste a paragraph from another book into yours, it's infringing, and a career-ending scandal. There's plenty of precedent on that.
Just like if you manually copied a function out of some GPL code and pasted it into your own.
Or if you had an LLM do it for you.
It is strange that you think the law is settled when I don't think even this "societal desire" is completely settled just yet.
Seems to me the training of AI is not radically different than compression algorithms building up a dictionary and compressing data.
Yet nobody calls JPEG compression “transformative”.
Could one do lossy compression over billions of copyrighted images to “train” a dictionary?
This doesn’t seem like a disputable statement to me. For anyone who thinks actors’ likenesses, authors’ words, all of it- that all and everything should be up for grabs once written or put anywhere in public, that is not a widely held opinion.
Once that’s established, it all comes down to implementation details.
If MS were compelled to reveal how these completions are generated, there’s at least a possibility that they directly use public repositories to source text chunks that their “model” suggested were relevant (quoted as it could be more than just a model, like vector or search databases or some other orchestration across multiple workloads).
If you find a human that did that send them my way, I'll hire them.
https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
(not legal advice)
Transformative works are necessarily derivative, but that transformation allows for a legal claim to "fair use" regardless of making a derived work.
LLMs are interesting because they can combine things they learn from multiple projects into a new language that doesn't feature in any of them, and pick up details from your request.
Unless you're schizophrenic enough to insist that you never even see other code it's just not a realistic problem
Honestly I've had big arguments about this IP stuff before and unless you actually have a lawyer specifically go after something or very obviously violate the GPL it's just a tactic for people to slow people they don't like down. People find a way to invent HR departments fractally.
Coders don't get paid every single time their code runs. Why bundle different rights together?
Like this?
Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book - https://news.ycombinator.com/context?id=44972296 - 67 days ago (313 comments)
They do if they code the API correctly.
> Why bundle different rights together?
Why are mineral rights sold separately to most land deeds?
Excerpt from the user agreement:
When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit. For example, this license includes the right to use Your Content to train AI and machine learning models, as further described in our Public Content Policy. You also agree that we may remove metadata associated with Your Content, and you irrevocably waive any claims and assertions of moral rights or attribution with respect to Your Content.
People put their heads in the sand over reddit for some reason, but it's worse than FAANG. With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed.LLMs do not have an internal model for manipulating mathematical objects. They cannot, by design, come up with new algorithms unless they are very nearly the same as some other algorithm. I'm a computer science researcher and have not heard of a single algorithm created by LLM.
An llm is looking at the shape of words and ideas over scale and using that to provide answers.
How complex does a mechanical transformation have to be to not be considered plagiarism, copyright infringement or parasitism?
If somebody writes a GPL-licensed program, is it enough to change all variable and function names to get rid of those pesky users' rights? Do you have to change the order of functions? Do you have to convert it to a different language? Surely nobody would claim c2rust is transformative even though the resulting code can be wildly different if you apply enough mechanical transformations.
All LLMs do is make the mechanical transformations 1) probabilistic 2) opaque 3) all at once 4) using multiple projects as a source.
It really is that simple.
Forcing something on people from a position of power is never in their favor.
I don't see why a company which has been waging a multi decade war against GPL and users' rights would stop at _public_ repositories.
Advertising autocomplete as AI was a genius move because people start humanizing it and look for human-centric patterns.
Thinking A"I" can do anything on its own is like seeing faces in rocks on Mars.
The only thing it suggests is that they recognize that a subset of users worry about it. Whether or not GitHub worries about it any further isn’t suggested.
Don’t think about it from an actual “rights” perspective. Think about the entire copyright issue as a “too big to fail” issue.
Because the population does not rebel against the politicians that made these laws.
The only difference, really, is we know how a JPEG algorithm works. If I wanted to, I could painstakingly make a jpeg by hand. We don't know how LLMs work.
Obviously not ChatGPT. But ChatGPT isn't the sharpest stick on the block by a significant margin. It is a mistake to judge what AIs can do based on what ChatGPT does.
And it's not an accident that significant percentage (40%?) of all papers being published in top journals involve application of AIs.
Legally speaking, this depends from domain to domain. But consider for example extracting facts from several biology textbooks, and then delivering those facts to the user in the characteristic ChatGPT tone that is distinguishable from the style of each source textbook. You can then be quite assured that courts will not find that you have infringed on copyright.
As a user of Reddit, I think it’s cool, and also raises some concerns.
I think most sites that handle user data are going to have rough edges. Making money off of user content is never without issues.
The AI coming up with it? When Google claimed their Wizard of Oz show at the Las Vegas Sphere was AI-generated, a ton of VFX artists spoke up to say they'd spent months of human labor working on it. Forgive me for not giving the benefit of the doubt to a company that has a vested interest in making their AI seem more powerful, and a track record of lying to do so.
The nature of network effects is such that once a site gets as big as reddit (or facebook or tiktok or whichever), it's nearly impossible for competition to take over in the same design space.
Many communities (both small and large) are only present on specific platforms (sometimes only one) and if you want to participate you have to accept their terms or exclude yourself socially.
Most communities on Reddit that I’d care to be a part of have additional places to gather, but I do take your point that there are few good alternatives to r/jailbreak, for example.
The host always sets its own rules. How else could anything actually get done? The coordination problem is hard enough as it is. It’s a wonder that social media exists at all.
Gatekeepers will always exist adjacent to the point of entry, otherwise every site turns extremist and becomes overrun with scammers and spammers.
I comprehend it just fine, I was adding context for those who may not comprehend.