Sure it’s a big hill to climb in rethinking IP laws to align with a societal desire that generating IP continue to be a viable economic work product, but that is what’s necessary.
This is far from settled law. Let's not mischaracterize it.
Even so, an AI regurgitating proprietary code that's licensed in some other way is a very real risk.
So. Yes, technically possible. But impossible by accident. Furthermore when you make this argument you reveal that you don't understand how these models work. They do not simply compress all the data they were trained on into a tiny storable version. They are effectively multiplication matrices that allow math to be done to predict the most likely next token (read: 2-3 Unicode characters) given some input.
So the model does not "contain" code. It "contains" a way of doing calculations for predicting what text comes next.
Finally, let's say that it is possible that the model does spit out not entire works, but a handful of lines of code that appear in some codebase.
This does not constitute copyright infringement, as the lines in question a) represent a tiny portion of the whole work (and copyright only protecst against the reduplication of whole works or siginficant portions of the work), and B) there are a limited number of ways to accomplish a certain function and it is not only possible but inevitable that two devs working independently could arrive at the same implementation. Therefore using an identical implementation (which is what this case would be) of a part of a work is no more illegal than the use of a certain chord progression or melodic phrasing or drum rhythm. Courts have ruled about this thoroughly.
If you have code that happens to be identical to some else's code or implements someone's proprietary algorithm, you're going to lose in court even if you claim an "AI" gave it to you.
AI is training on private Github repos and coughing them up. I've had it regurgitate a very well written piece of code to do a particular computational geometry algorithm. It presented perfect, idiomatic Python with perfect tests that caught all the degenerate cases. That was obviously proprietary code--no amount of searching came up with anything even remotely close (it's why I asked the AI, after all).
Not for a dozen lines here or there, even if it could be found and identified in a massive code base. That’s like quoting a paragraph of a book in another book, non infringing.
For the second half of your comment it sounds like you’re saying you got results that were too good to be AI- that’s a bit “no true Scotsman”, at least without more detail. But implementing an algorithm, even a complex one, is very much something an LLM can do. Algorithms are much better defined and scoped natural language, and LLMs do a reasonable job of translating to languages. An algorithm is a narrow subset of that task type with better defined context and syntax.
Judge Alsup, in his ruling, specifically likened the process to reading text and then using the knowledge to write something else. That’s training and use.
Well, AI can perhaps solve the problem it created here: generated IP with AI is much cheaper than with humans, so it will be viable even at lower payoffs.
Less cynical: you can use trade secrets to protect your IP. You can host your software and only let customers interact with it remotely, like what Google (mostly) does.
Of course, this is a very software-centric view. You can't 'protect' eg books or music in this way.
Also publishing pirated IP without any monetary gain to yourself also used to be treated more leniently.
Of course, all the rules were changed (both in law and in interpretation in practice) as file sharing became a huge deal about two decades ago.
Details depend on jurisdiction.
It's easier for the LLM to rewrite an idiomatic computational geometry algorithm from scratch in a language it understands well like Python. Entire computational geometry textbooks and research papers are in its knowledge base. It doesn't have to copy some proprietary implementation.
There is an entire research field of scientific discovery using LLMs together with sub-disciplines for the various specialization. LLMs routinely discover new things.
Yes, the training of the model itself is (or should be) a transformative act so you can train a model on whatever you have legal access to view.
However, that doesn't mean that the output of the model is automatically not infringing. If the model is prompted to create a copy of some copyrighted work, that is (or should be) still a violation.
Just like memorizing a book isn't infringment but reproducing a book from memory is.
(I find the example of the computational geometry algorithm being a clear case of direct memorization not very compelling, in any case.)
It's potentially non-infringing in a book if you quote it in a plausible way, and properly.
If you copy&paste a paragraph from another book into yours, it's infringing, and a career-ending scandal. There's plenty of precedent on that.
Just like if you manually copied a function out of some GPL code and pasted it into your own.
Or if you had an LLM do it for you.
It is strange that you think the law is settled when I don't think even this "societal desire" is completely settled just yet.
Seems to me the training of AI is not radically different than compression algorithms building up a dictionary and compressing data.
Yet nobody calls JPEG compression “transformative”.
Could one do lossy compression over billions of copyrighted images to “train” a dictionary?
This doesn’t seem like a disputable statement to me. For anyone who thinks actors’ likenesses, authors’ words, all of it- that all and everything should be up for grabs once written or put anywhere in public, that is not a widely held opinion.
Once that’s established, it all comes down to implementation details.
If MS were compelled to reveal how these completions are generated, there’s at least a possibility that they directly use public repositories to source text chunks that their “model” suggested were relevant (quoted as it could be more than just a model, like vector or search databases or some other orchestration across multiple workloads).
https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
(not legal advice)
Transformative works are necessarily derivative, but that transformation allows for a legal claim to "fair use" regardless of making a derived work.
Coders don't get paid every single time their code runs. Why bundle different rights together?
Like this?
Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book - https://news.ycombinator.com/context?id=44972296 - 67 days ago (313 comments)
They do if they code the API correctly.
> Why bundle different rights together?
Why are mineral rights sold separately to most land deeds?
Excerpt from the user agreement:
When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit. For example, this license includes the right to use Your Content to train AI and machine learning models, as further described in our Public Content Policy. You also agree that we may remove metadata associated with Your Content, and you irrevocably waive any claims and assertions of moral rights or attribution with respect to Your Content.
People put their heads in the sand over reddit for some reason, but it's worse than FAANG. With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed.LLMs do not have an internal model for manipulating mathematical objects. They cannot, by design, come up with new algorithms unless they are very nearly the same as some other algorithm. I'm a computer science researcher and have not heard of a single algorithm created by LLM.
An llm is looking at the shape of words and ideas over scale and using that to provide answers.
How complex does a mechanical transformation have to be to not be considered plagiarism, copyright infringement or parasitism?
If somebody writes a GPL-licensed program, is it enough to change all variable and function names to get rid of those pesky users' rights? Do you have to change the order of functions? Do you have to convert it to a different language? Surely nobody would claim c2rust is transformative even though the resulting code can be wildly different if you apply enough mechanical transformations.
All LLMs do is make the mechanical transformations 1) probabilistic 2) opaque 3) all at once 4) using multiple projects as a source.
It really is that simple.
Forcing something on people from a position of power is never in their favor.
I don't see why a company which has been waging a multi decade war against GPL and users' rights would stop at _public_ repositories.
Advertising autocomplete as AI was a genius move because people start humanizing it and look for human-centric patterns.
Thinking A"I" can do anything on its own is like seeing faces in rocks on Mars.
The only thing it suggests is that they recognize that a subset of users worry about it. Whether or not GitHub worries about it any further isn’t suggested.
Don’t think about it from an actual “rights” perspective. Think about the entire copyright issue as a “too big to fail” issue.
Because the population does not rebel against the politicians that made these laws.
The only difference, really, is we know how a JPEG algorithm works. If I wanted to, I could painstakingly make a jpeg by hand. We don't know how LLMs work.
Obviously not ChatGPT. But ChatGPT isn't the sharpest stick on the block by a significant margin. It is a mistake to judge what AIs can do based on what ChatGPT does.
And it's not an accident that significant percentage (40%?) of all papers being published in top journals involve application of AIs.
Legally speaking, this depends from domain to domain. But consider for example extracting facts from several biology textbooks, and then delivering those facts to the user in the characteristic ChatGPT tone that is distinguishable from the style of each source textbook. You can then be quite assured that courts will not find that you have infringed on copyright.
As a user of Reddit, I think it’s cool, and also raises some concerns.
I think most sites that handle user data are going to have rough edges. Making money off of user content is never without issues.
The AI coming up with it? When Google claimed their Wizard of Oz show at the Las Vegas Sphere was AI-generated, a ton of VFX artists spoke up to say they'd spent months of human labor working on it. Forgive me for not giving the benefit of the doubt to a company that has a vested interest in making their AI seem more powerful, and a track record of lying to do so.
The nature of network effects is such that once a site gets as big as reddit (or facebook or tiktok or whichever), it's nearly impossible for competition to take over in the same design space.
Many communities (both small and large) are only present on specific platforms (sometimes only one) and if you want to participate you have to accept their terms or exclude yourself socially.
Most communities on Reddit that I’d care to be a part of have additional places to gather, but I do take your point that there are few good alternatives to r/jailbreak, for example.
The host always sets its own rules. How else could anything actually get done? The coordination problem is hard enough as it is. It’s a wonder that social media exists at all.
Gatekeepers will always exist adjacent to the point of entry, otherwise every site turns extremist and becomes overrun with scammers and spammers.
I comprehend it just fine, I was adding context for those who may not comprehend.