Content on StackOverflow is under CC-by-sa, version depends on the date it was submitted: https://stackoverflow.com/help/licensing . (It's really unfortunate that they didn't pick license compatible with code; at one point they started to move to the MIT license for code, but then didn't follow through on it.)
Your question makes sense. See U.S. Copyright Office publication:
> If a work's traditional elements of authorship were produced by a machine, the work lacks human authorship and the Office will not register it.
> For example, when an AI technology receives solely a prompt from a human and produces complex written, visual, or musical works in response, the “traditional elements of authorship” are determined and executed by the technology—not the human user...
> For example, if a user instructs a text-generating technology to “write a poem about copyright law in the style of William Shakespeare,” she can expect the system to generate text that is recognizable as a poem, mentions copyright, and resembles Shakespeare's style. But the technology will decide the rhyming pattern, the words in each line, and the structure of the text.
> When an AI technology determines the expressive elements of its output, the generated material is not the product of human authorship. As a result, that material is not protected by copyright and must be disclaimed in a registration application.
> In other cases, however, a work containing AI-generated material will also contain sufficient human authorship to support a copyright claim. For example, a human may select or arrange AI-generated material in a sufficiently creative way that “the resulting work as a whole constitutes an original work of authorship.”
> Or an artist may modify material originally generated by AI technology to such a degree that the modifications meet the standard for copyright protection. In these cases, copyright will only protect the human-authored aspects of the work, which are “independent of” and do “not affect” the copyright status of the AI-generated material itself.
> This policy does not mean that technological tools cannot be part of the creative process. Authors have long used such tools to create their works or to recast, transform, or adapt their expressive authorship. For example, a visual artist who uses Adobe Photoshop to edit an image remains the author of the modified image, and a musical artist may use effects such as guitar pedals when creating a sound recording. In each case, what matters is the extent to which the human had creative control over the work's expression and “actually formed” the traditional elements of authorship.
> https://www.federalregister.gov/documents/2023/03/16/2023-05...
In any but a pathological case, a real contribution code to a real project has sufficient human authorship to be copyrightable.
> the license of the project lies about parts of the code
That was a concern pre-AI too! E.g. copy-past from StackOverflow. Projects require contributors to sign CLAs, which doesn't guarantee compliance, but strengthens the legal position. Usually something like:
"You represent that your contribution is either your original creation or you have sufficient rights to submit it."
Example where this kind of contribution was accepted and valuable, inside this ghostty project https://x.com/mitchellh/status/1957930725996654718
Imagine seeing “rm -rf / is a function that returns “Hello World!” and thinking “this is the same thing as the printing press”
--
[1] https://www.copyright.gov/ai/
[2] https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
The gist being - language (text input) is actually the vehicle you have to transfer neural state to the engine. When you are working in a greenfield project or pure-vibe project, you can get away with most of that neural state being in the "default" probability mode. But in a legacy project, you need significantly more context to contrain the probability distributions a lot closer to the decisions which were made historically.
We don't yet know how courts will rule on cases like Does v Github (https://githubcopilotlitigation.com/case-updates.html). LLM-based systems are not even capable of practicing clean-room design (https://en.wikipedia.org/wiki/Clean_room_design). For a maintainer to accept code generated by an LLM is to put the entire community at risk, as well as to endorse a power structure that mocks consent.
https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
(not legal advice)
Transformative works are necessarily derivative, but that transformation allows for a legal claim to "fair use" regardless of making a derived work.
"novel" here depends on what you mean. Could an LLM produce output that is unique that both it and no one else has seen before, possibly yes. Could that output have perceived or emotional value to people, sure. Related challenge: Is a random encryption key generated by a csprng novel?
In the case of the US copyright office, if there wasn't sufficient human involvement in the production then the output is not copyrightable and how "novel" it is does not matter - but that doesn't necessarily impact a prior production by a human that is (whether a copy or not). Novel also only matters in a subset of the many fractured areas of copyright laws affecting the space of this form of digital replication. The copyright office wrote: https://www.copyright.gov/ai/Copyright-and-Artificial-Intell....
Where I imagine this approximately ends up is some set of tests that are oriented around how relevant to the whole the "copy" is, that is, it may not matter whether the method of production involved "copying", but may more matter if the whole works in which it is included are at large a copy, or, if the area contested as a copy, if it could be replaced with something novel, and it is a small enough piece of the whole, then it may not be able to meet some bar of material value to the whole to be relevant - that there is no harmful infringement, or similarly could cross into some notion of fair use.
I don't see much sanity in a world where small snippets become an issue. I think if models were regularly producing thousands of tokens of exactly duplicate content that's probably an issue.
I've not seen evidence of the latter outside of research that very deliberately performs active search for high probability cases (such as building suffix tree indices over training sets then searching for outputs based on guidance from the index). That's very different from arbitrary work prompts doing the same, and the models have various defensive trainings and wrappings attempting to further minimize reproductive behavior. On the one hand you have research metrics like 3.6 bits per parameter of recoverable input, on the other hand that represents a very small slice of the training set, and many such reproductions requiring strongly crafted and long prompts - meaning that for arbitrary real world interaction the chance of large scale overlap is small.
Like this?
Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book - https://news.ycombinator.com/context?id=44972296 - 67 days ago (313 comments)
https://www.jetbrains.com/help/idea/full-line-code-completio...
Just to be sure that I wasn't misremembering, I went through part 2 of the report and back to the original memorandum[1] that was sent out before the full report issued. I've included a few choice quotes to illustrate my point:
"These are no longer hypothetical questions, as the Office is already receiving and examining applications for registration that claim copyright in AI-generated material. For example, in 2018 the Office received an application for a visual work that the applicant described as “autonomously created by a computer algorithm running on a machine.” 7 The application was denied because, based on the applicant’s representations in the application, the examiner found that the work contained no human authorship. After a series of administrative appeals, the Office’s Review Board issued a final determination affirming that the work could not be registered because it was made “without any creative contribution from a human actor.”"
"More recently, the Office reviewed a registration for a work containing human-authored elements combined with AI-generated images. In February 2023, the Office concluded that a graphic novel comprised of human-authored text combined with images generated by the AI service Midjourney constituted a copyrightable work, but that the individual images themselves could not be protected by copyright. "
"In the Office’s view, it is well-established that copyright can protect only material that is the product of human creativity. Most fundamentally, the term “author,” which is used in both the Constitution and the Copyright Act, excludes non-humans."
"In the case of works containing AI-generated material, the Office will consider whether the AI contributions are the result of “mechanical reproduction” or instead of an author’s “own original mental conception, to which [the author] gave visible form.” The answer will depend on the circumstances, particularly how the AI tool operates and how it was used to create the final work. This is necessarily a case-by-case inquiry."
"If a work’s traditional elements of authorship were produced by a machine, the work lacks human authorship and the Office will not register it."[1], pgs 2-4
---
On the odd chance that somehow the Copyright Office had reversed itself I then went back to part 2 of the report:
"As the Office affirmed in the Guidance, copyright protection in the United States requires human authorship. This foundational principle is based on the Copyright Clause in the Constitution and the language of the Copyright Act as interpreted by the courts. The Copyright Clause grants Congress the authority to “secur[e] for limited times to authors . . . the exclusive right to their . . . writings.” As the Supreme Court has explained, “the author [of a copyrighted work] is . . . the person who translates an idea into a fixed, tangible expression entitled to copyright protection.”
"No court has recognized copyright in material created by non-humans, and those that have spoken on this issue have rejected the possibility. "
"In most cases, however, humans will be involved in the creation process, and the work will be copyrightable to the extent that their contributions qualify as authorship." -- [2], pgs 15-16
---
TL;DR If you make something with the assistance of AI, you still have to be personally involved and contribute more than just a prompt in order to receive copyright, and then you will receive protection only over such elements of originality and authorship that you are responsible for, not those elements which the AI is responsible for.
--- [1] https://copyright.gov/ai/ai_policy_guidance.pdf [2] https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
Obviously not ChatGPT. But ChatGPT isn't the sharpest stick on the block by a significant margin. It is a mistake to judge what AIs can do based on what ChatGPT does.
And it's not an accident that significant percentage (40%?) of all papers being published in top journals involve application of AIs.
Doesn't anyone see that this can't be policed or everyone becomes a criminal? That AI will bring the end of copyrights and patents as we know them when literally everything becomes a derivative work? When children produce better solutions than industry veterans so we punish them rather than questioning the divine right of corporations to rule over us? What happened to standing on the shoulders of giants?
I wonder if a lot of you are as humbled as I am by the arrival of AI. Whenever I use it, I'm in awe of what it comes up with when provided almost no context, coming in cold to something that I've been mulling over for hours. And it's only getting better. In 3-5 years, it will leave all of us behind. I'm saying that as someone who's done this for decades and has been down rabbit holes that most people have no frame of reference for.
My prediction is that like with everything, we'll bungle this. Those of you with big egos and large hoards of wealth that you thought you earned because you are clever will do everything you can to micromanage and sabotage what could have been the first viable path to liberation from labor in human history. Just like with the chilling effects of the Grand Upright Music ruling and the DMCA and HBO suing BitTorrent users (edit: I meant the MPAA and RIAA, HBO "only" got people's internet shut off), we have to remain eternally vigilant or the powers that be will take away all the fun:
https://en.wikipedia.org/wiki/Sampling_(music)
https://en.wikipedia.org/wiki/Digital_Millennium_Copyright_A...
https://en.wikipedia.org/wiki/Legal_issues_with_BitTorrent#C...
So no, I won't even entertain the notion of demanding proof of origin for ideas. I'm not going down this slippery slope of suing every open source project that gives away its code for free, just because a PR put pieces together in a new way but someone thought of the same idea in private and thinks they're special.
> Contributed Code
> In order to keep SQLite completely free and unencumbered by copyright, the project does not accept patches. If you would like to suggest a change and you include a patch as a proof-of-concept, that would be great. However, please do not be offended if we rewrite your patch from scratch.