AI tooling must be disclosed for contributions

>>luma+z6
"see" and "copy" are two different things. It's fine to look at StackOverflow to understand the solution to a problem. It's not fine to copy and paste from StackOverflow and ignore its license or attribution.

Content on StackOverflow is under CC-by-sa, version depends on the date it was submitted: https://stackoverflow.com/help/licensing . (It's really unfortunate that they didn't pick license compatible with code; at one point they started to move to the MIT license for code, but then didn't follow through on it.)

>>bgwalt+F8
> I still do not understand

Your question makes sense. See U.S. Copyright Office publication:

> If a work's traditional elements of authorship were produced by a machine, the work lacks human authorship and the Office will not register it.

> For example, when an AI technology receives solely a prompt from a human and produces complex written, visual, or musical works in response, the “traditional elements of authorship” are determined and executed by the technology—not the human user...

> For example, if a user instructs a text-generating technology to “write a poem about copyright law in the style of William Shakespeare,” she can expect the system to generate text that is recognizable as a poem, mentions copyright, and resembles Shakespeare's style. But the technology will decide the rhyming pattern, the words in each line, and the structure of the text.

> When an AI technology determines the expressive elements of its output, the generated material is not the product of human authorship. As a result, that material is not protected by copyright and must be disclaimed in a registration application.

> In other cases, however, a work containing AI-generated material will also contain sufficient human authorship to support a copyright claim. For example, a human may select or arrange AI-generated material in a sufficiently creative way that “the resulting work as a whole constitutes an original work of authorship.”

> Or an artist may modify material originally generated by AI technology to such a degree that the modifications meet the standard for copyright protection. In these cases, copyright will only protect the human-authored aspects of the work, which are “independent of” and do “not affect” the copyright status of the AI-generated material itself.

> This policy does not mean that technological tools cannot be part of the creative process. Authors have long used such tools to create their works or to recast, transform, or adapt their expressive authorship. For example, a visual artist who uses Adobe Photoshop to edit an image remains the author of the modified image, and a musical artist may use effects such as guitar pedals when creating a sound recording. In each case, what matters is the extent to which the human had creative control over the work's expression and “actually formed” the traditional elements of authorship.

> https://www.federalregister.gov/documents/2023/03/16/2023-05...

In any but a pathological case, a real contribution code to a real project has sufficient human authorship to be copyrightable.

> the license of the project lies about parts of the code

That was a concern pre-AI too! E.g. copy-past from StackOverflow. Projects require contributors to sign CLAs, which doesn't guarantee compliance, but strengthens the legal position. Usually something like:

"You represent that your contribution is either your original creation or you have sufficient rights to submit it."

>>freeto+(OP)
Little offtop: would someone remember mitchellh's setup for working with AI tools? I remember someone posted in an AI-hate-love threads here and it's not in the his blog[1]

1: https://mitchellh.com/writing

>>KritVu+kc
That’s not what I said though. LLM output, even unreviewed and without understanding, can be a useful artifact. I do it all the time - generate code, try running it, and then if I see it works well, I can decide to review it and follow up with necessary refactoring before integrating it. Parts of that can be contributed too. We’re just learning new etiquettes for doing that productively, and that does includes testing the PR btw (even if the code itself is not understood or reviewed).

Example where this kind of contribution was accepted and valuable, inside this ghostty project https://x.com/mitchellh/status/1957930725996654718

>>oceanp+na
> Imagine living before the invention of the printing press, and then lamenting that we should ban them because it makes it "too easy" to distribute information

Imagine seeing “rm -rf / is a function that returns “Hello World!” and thinking “this is the same thing as the printing press”

https://bsky.app/profile/lookitup.baby/post/3lu2bpbupqc2f

>>Waterl+A3
It's not just about how you got there. At least in the United States according to the Copyright Office... materials produced by artificial intelligence are not eligible for copyright. So, yeah, some people want to know for licensing purposes. I don't think that's the case here, but it is yet another reason to require that kind of disclosure... since if you fail to mention that something was made by AI as part of a compound work you could end up losing copyright over the whole thing. For more details, see [2] (which is part of the larger report on Copyright and AI at [1]).

--

[1] https://www.copyright.gov/ai/

[2] https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...

>>JoshTr+gb
CC BY-SA 4.0 is "compatible with code". It is, for example, GPL-compatible (see https://wiki.creativecommons.org/wiki/ShareAlike_compatibili...). It's just not designed for code.

>>mglvsk+jc
maybe one of these? https://x.com/mitchellh/status/1952905654458564932 https://www.youtube.com/watch?v=XyQ4ZTS5dGw

>>jerf+rc
I have had similar observations to you and tried to capture them here: https://www.linkedin.com/posts/alex-buie-35b488158_ai-neuros...

The gist being - language (text input) is actually the vehicle you have to transfer neural state to the engine. When you are working in a greenfield project or pure-vibe project, you can get away with most of that neural state being in the "default" probability mode. But in a legacy project, you need significantly more context to contrain the probability distributions a lot closer to the decisions which were made historically.

>>philjo+o9
I've been doing that for most of the commits in this project as an experiment, gemini, human, or both. Not sure what I'm going to do with that history, but I did at least want to start capturing it

https://github.com/blebbit/at-mirror/commits/main/

>>freeto+(OP)
Provenance matters. An LLM cannot certify a Developer Certificate of Origin (https://en.wikipedia.org/wiki/Developer_Certificate_of_Origi...) and a developer of integrity cannot certify the DCO for code emitted by an LLM, certainly not an LLM trained on code of unknown provenance. It is well-known that LLMs sometimes produce verbatim or near-verbatim copies of their training data, most of which cannot be used without attribution (and may have more onerous license requirements). It is also well-known that they don't "understand" semantics: they never make changes for the right reason.

We don't yet know how courts will rule on cases like Does v Github (https://githubcopilotlitigation.com/case-updates.html). LLM-based systems are not even capable of practicing clean-room design (https://en.wikipedia.org/wiki/Clean_room_design). For a maintainer to accept code generated by an LLM is to put the entire community at risk, as well as to endorse a power structure that mocks consent.

>>eru+AP
In the US you can not generate copyrightable IP without substantial human contribution to the process.

https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...

>>BobbyT+aX
> I’m curious … So “transformative” is not necessarily “derivative”?

(not legal advice)

Transformative works are necessarily derivative, but that transformation allows for a legal claim to "fair use" regardless of making a derived work.

https://en.wikipedia.org/wiki/Transformative_use

>>raggi+gZ
Tell that to Reddit. They’re AI translating user posts and serving it up as separate Google search results. I don’t remember if Reddit claims copyright on user-submitted content, or on its AI translations, but I don’t think Reddit is paying ad share like X is, either, so it kind of doesn’t matter to the user, as they’re (still) not getting paid, even as Reddit collects money for every ad shown/clicked. Even if OP did write it, an AI translated the version shown.

https://news.ycombinator.com/context?id=44972296

>>j4coh+I01
"an LLM" could imply an LLM of any size, for sufficiently small or focused training sets an LLM may not be transformative. There is some scale at which the volume and diversity of training data and intricacy of abstraction moves away from something you could reasonably consider solely memorization - there's a separate issue of reproduction though.

"novel" here depends on what you mean. Could an LLM produce output that is unique that both it and no one else has seen before, possibly yes. Could that output have perceived or emotional value to people, sure. Related challenge: Is a random encryption key generated by a csprng novel?

In the case of the US copyright office, if there wasn't sufficient human involvement in the production then the output is not copyrightable and how "novel" it is does not matter - but that doesn't necessarily impact a prior production by a human that is (whether a copy or not). Novel also only matters in a subset of the many fractured areas of copyright laws affecting the space of this form of digital replication. The copyright office wrote: https://www.copyright.gov/ai/Copyright-and-Artificial-Intell....

Where I imagine this approximately ends up is some set of tests that are oriented around how relevant to the whole the "copy" is, that is, it may not matter whether the method of production involved "copying", but may more matter if the whole works in which it is included are at large a copy, or, if the area contested as a copy, if it could be replaced with something novel, and it is a small enough piece of the whole, then it may not be able to meet some bar of material value to the whole to be relevant - that there is no harmful infringement, or similarly could cross into some notion of fair use.

I don't see much sanity in a world where small snippets become an issue. I think if models were regularly producing thousands of tokens of exactly duplicate content that's probably an issue.

I've not seen evidence of the latter outside of research that very deliberately performs active search for high probability cases (such as building suffix tree indices over training sets then searching for outputs based on guidance from the index). That's very different from arbitrary work prompts doing the same, and the models have various defensive trainings and wrappings attempting to further minimize reproductive behavior. On the one hand you have research metrics like 3.6 bits per parameter of recoverable input, on the other hand that represents a very small slice of the training set, and many such reproductions requiring strongly crafted and long prompts - meaning that for arbitrary real world interaction the chance of large scale overlap is small.

>>JoshTr+3V
It does help with prior versions, since you can use CC BY-SA 3.0 material under CC BY-SA 4.0, which is GPLv3-compatible. (See https://meta.stackexchange.com/a/337742/308065.) It doesn't necessarily help with future versions.

>>popalc+1i
> No more so than regurgitating an entire book.

Like this?

Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book - https://news.ycombinator.com/context?id=44972296 - 67 days ago (313 comments)

>>kentm+T61
The ghostty creator disagrees re: the productivity of un-reviewed generated PRs: https://x.com/mitchellh/status/1957930725996654718

>>freedo+i5
The line isn’t as clear as you might think, eg jetbrains has a mini on-device neural net powered autocomplete:

https://www.jetbrains.com/help/idea/full-line-code-completio...

>>victor+V5
You might be interested in reading Part 2 of the US Copyright Office's report on Copyright and Artificial Intelligence: <https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...>

>>eru+hA1
It's already been done.

https://en.wikipedia.org/wiki/Generative_artificial_intellig...

>>smitop+FM
This is what you get for skimming. :D

Just to be sure that I wasn't misremembering, I went through part 2 of the report and back to the original memorandum[1] that was sent out before the full report issued. I've included a few choice quotes to illustrate my point:

"These are no longer hypothetical questions, as the Office is already receiving and examining applications for registration that claim copyright in AI-generated material. For example, in 2018 the Office received an application for a visual work that the applicant described as “autonomously created by a computer algorithm running on a machine.” 7 The application was denied because, based on the applicant’s representations in the application, the examiner found that the work contained no human authorship. After a series of administrative appeals, the Office’s Review Board issued a final determination affirming that the work could not be registered because it was made “without any creative contribution from a human actor.”"

"More recently, the Office reviewed a registration for a work containing human-authored elements combined with AI-generated images. In February 2023, the Office concluded that a graphic novel comprised of human-authored text combined with images generated by the AI service Midjourney constituted a copyrightable work, but that the individual images themselves could not be protected by copyright. "

"In the Office’s view, it is well-established that copyright can protect only material that is the product of human creativity. Most fundamentally, the term “author,” which is used in both the Constitution and the Copyright Act, excludes non-humans."

"In the case of works containing AI-generated material, the Office will consider whether the AI contributions are the result of “mechanical reproduction” or instead of an author’s “own original mental conception, to which [the author] gave visible form.” The answer will depend on the circumstances, particularly how the AI tool operates and how it was used to create the final work. This is necessarily a case-by-case inquiry."

"If a work’s traditional elements of authorship were produced by a machine, the work lacks human authorship and the Office will not register it."[1], pgs 2-4

---

On the odd chance that somehow the Copyright Office had reversed itself I then went back to part 2 of the report:

"As the Office affirmed in the Guidance, copyright protection in the United States requires human authorship. This foundational principle is based on the Copyright Clause in the Constitution and the language of the Copyright Act as interpreted by the courts. The Copyright Clause grants Congress the authority to “secur[e] for limited times to authors . . . the exclusive right to their . . . writings.” As the Supreme Court has explained, “the author [of a copyrighted work] is . . . the person who translates an idea into a fixed, tangible expression entitled to copyright protection.”

"No court has recognized copyright in material created by non-humans, and those that have spoken on this issue have rejected the possibility. "

"In most cases, however, humans will be involved in the creation process, and the work will be copyrightable to the extent that their contributions qualify as authorship." -- [2], pgs 15-16

---

TL;DR If you make something with the assistance of AI, you still have to be personally involved and contribute more than just a prompt in order to receive copyright, and then you will receive protection only over such elements of originality and authorship that you are responsible for, not those elements which the AI is responsible for.

--- [1] https://copyright.gov/ai/ai_policy_guidance.pdf [2] https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...

>>rowanG+AF
An example: https://medium.com/@deshmukhpratik931/the-matrix-multiplicat...

Obviously not ChatGPT. But ChatGPT isn't the sharpest stick on the block by a significant margin. It is a mistake to judge what AIs can do based on what ChatGPT does.

>>jakela+cP
https://medium.com/@deshmukhpratik931/the-matrix-multiplicat...

And it's not an accident that significant percentage (40%?) of all papers being published in top journals involve application of AIs.

>>freeto+(OP)
Um, no.

Doesn't anyone see that this can't be policed or everyone becomes a criminal? That AI will bring the end of copyrights and patents as we know them when literally everything becomes a derivative work? When children produce better solutions than industry veterans so we punish them rather than questioning the divine right of corporations to rule over us? What happened to standing on the shoulders of giants?

I wonder if a lot of you are as humbled as I am by the arrival of AI. Whenever I use it, I'm in awe of what it comes up with when provided almost no context, coming in cold to something that I've been mulling over for hours. And it's only getting better. In 3-5 years, it will leave all of us behind. I'm saying that as someone who's done this for decades and has been down rabbit holes that most people have no frame of reference for.

My prediction is that like with everything, we'll bungle this. Those of you with big egos and large hoards of wealth that you thought you earned because you are clever will do everything you can to micromanage and sabotage what could have been the first viable path to liberation from labor in human history. Just like with the chilling effects of the Grand Upright Music ruling and the DMCA and HBO suing BitTorrent users (edit: I meant the MPAA and RIAA, HBO "only" got people's internet shut off), we have to remain eternally vigilant or the powers that be will take away all the fun:

https://en.wikipedia.org/wiki/Sampling_(music)

https://en.wikipedia.org/wiki/Digital_Millennium_Copyright_A...

https://en.wikipedia.org/wiki/Legal_issues_with_BitTorrent#C...

So no, I won't even entertain the notion of demanding proof of origin for ideas. I'm not going down this slippery slope of suing every open source project that gives away its code for free, just because a PR put pieces together in a new way but someone thought of the same idea in private and thinks they're special.

>>jedbro+gW
This is how sqlite handles it,

> Contributed Code

> In order to keep SQLite completely free and unencumbered by copyright, the project does not accept patches. If you would like to suggest a change and you include a patch as a proof-of-concept, that would be great. However, please do not be offended if we rewrite your patch from scratch.

source, https://www.sqlite.org/copyright.html

zlacker

AI tooling must be disclosed for contributions