zlacker

> Fair use is a doctrine in the law of the United States that allows limited use of copyrighted material without requiring permission from the rights holders. It provides for the legal, non-licensed citation or incorporation of copyrighted material in another author's work under a four-factor balancing test:

> 1) The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes

> 2) The nature of the copyrighted work

> 3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole

> 4) The effect of the use upon the potential market for or value of the copyrighted work

[emphasis from TFA]

HN always talks about derivative work and transformativeness, but never about these. The fourth one especially seems clear in its implications for models.

Regardless, it makes it seem much less clear cut than people here often say.

replies(5): >>flango+K >>civili+z1 >>amluto+T2 >>beware+a3 >>_fbpp+N3

>>6gvONx+(OP)
Microsoft is gambling on the hope that model training will be ruled fair use. This makes it seem that outcome is unlikely.

replies(1): >>brooks+62

>>6gvONx+(OP)
That’s not at all clear to me. IANAL but first of all it’s a balancing test, not a bright-line test. The judge could focus on any one factor and make an argument for either side quite easily.

Second, “use” here could mean one of two things: training or inference. It’s publishing the results of inference that can lead to actual effects on the market, not the training.

At the end of the day, someone has to prove tangible harm.

>>flango+K
Do you think a human learning something from reading is fair use? Or are we all copyright violators because reading that article altered our connectomes, and we may recall parts of it later?

replies(2): >>ethanb+13 >>snicke+8a1

>>6gvONx+(OP)
I would look at #1 here. Crawling the Internet to collect information is one thing. (And people putting text on the web without requiring authentication seem to be granting at least some kind of license to anyone who sends a GET request.). But crawling the Internet (via centralized robots or users’ browsers), then storing that data and charging money to others for rights to that data (as Brave seems to be doing, quite explicitly) seems like it deserves a very different evaluation under factor #1.

>>brooks+62
The point being raised is quite specific. Not sure if you’re willingly ignoring it or what?

The answer is no, because you reading the article didn’t dramatically degrade its market value.

An AI ingesting all content on the internet and then being ultra-effective at frontrunning that content for a large number of future readers does degrade its market value (and subsumes it into the model’s value).

replies(3): >>ivalm+A3 >>cma+Iy >>everfo+iF

>>6gvONx+(OP)
Unpopular opinion time:

A ML model is clearly a derivative work of its input.

Here's what I think would be fair:

Anyone who holds copyright in something used as part of a training corpus is owed a proportional share of the cash flow resulting from use of the resulting models. (Cash flow, not profits, because it's too easy to use accounting tricks to make profits disappear).

In the case of intermediaries (e.g., social media like reddit & twitter) those intermediaries could take a cut before passing it on to the original authors.

Obviously hellishly difficult to administer so it's unlikely to happen but I don't see a better answer.

replies(2): >>mattbe+i7 >>Dylan1+Yu

>>ethanb+13
I disagree. People learning how to draw does degrade the future value of copyrighted work. Imagine the future where nobody was allowed to learn to draw, existing copyright value would skyrocket!

replies(1): >>llamai+x8

>>6gvONx+(OP)
The entire fair use claim is derived not from any legal basis, but rather, that "it has to be fair use" because it would be legally catastrophic for OpenAI et al if it weren't true.

If you look at the core argument in favour of fair use, it's that "LLMs do not copy the training data", yet this is obviously false.

For Github copilot and ChatGPT examples of it reciting large sections of training data are well known. Plenty can be found on HN. It doesn't generate a new valid windows serial key on the fly, it's memorized them.

If one wants to be cynical, it's not hard to see OpenAI/etc patching in filters to remove copyrighted content from the output precisely because it's legally catastrophic for their "fair use" claim to have the model spit out copyrighted content. As this is both copyright infringement by itself, and evidence that no matter how the internals of these models work, they store some of the training data anyway.

replies(2): >>twoodf+Jc >>kmeist+1v

>>beware+a3
I don't know what a fair settlement would be but I'm looking forward to a copyright-holder suing OpenAI to obtain one. These companies have no value if copyright can be enforced on their training data.

replies(1): >>visarg+Yn

>>ivalm+A3
Arguments like this are great for getting your side to go "rah rah got 'em" and really, really bad for convincing anyone else.

Legal judgments generally focus on actual impacts rather than quirks that might exist in hypothetical universes.

replies(1): >>tharku+2c

>>llamai+x8
While that may be what your parent intended I'm not entirely sure and there does exist the philosophical level discussion here. Or market economics level I guess.

If your pool of people that can learn about topic X is restricted the outputs or their labor are more expensive. Now lift a continent of billions of people out of poverty, get them access to schooling, safety etc and see the market forces do the rest.

Now equate ChatGPT et al with said billion people. Just that it runs on electricity. If quality is good enough of course. Which is hard to decide right now because of hype.

replies(1): >>ivalm+Bd1

>>_fbpp+N3
It actually doesn’t even matter if LLMs reproduce copyrighted data from their training. The issue is that a human copied the data from its source into memory for use in training, and this copy was likely not fair use under cases like MAI Systems.

The Supreme Court hasn’t ruled on a software case like this, as far as I know. But given the recent 7-2 decision against Andy Warhol’s estate for his copying of photographs of Prince, this doesn’t seem like a Court that’s ready to say copying terabytes of unlicensed material for a commercial purpose is OK.

I’m going to guess this ends with Congress setting up some kind of clearinghouse for copyrighted training material: You opt in to be included, you get fees from OpenAI when they use what you added. This isn’t unprecedented: Congress set up special rules and processes for things like music recordings repeatedly over the years.

https://scholarship.law.edu/cgi/viewcontent.cgi?referer=&htt...

replies(2): >>gyudin+Op >>luma+SD

>>mattbe+i7
I think there are ways around it. The simplest would be to generate replacement data, for example by paraphrasing the original, or summarising, or turning it into question-answer pairs. In this new format it can serve as training data for a clean LLM. Of course the public domain data would be used directly, no need to go synthetic there.

An important direction would be to train copyright attribution models, and diff-models to detect when a work is infringing on another, by direct comparison. They would be useful to filter both the training set and the model outputs.

replies(1): >>mattbe+RD

>>twoodf+Jc
So how is that supposed to work with people sending it legally obtained copyrighted materials for an analyze?

replies(1): >>twoodf+Ay

>>beware+a3
> A ML model is clearly a derivative work of its input.

Do you mean this in a copying sense or a mathematical sense?

What if it's only storing 1 byte per input document?

>>_fbpp+N3
OpenAI's bias research on DALL-E revealed that most examples of regurgitation come from repeated copies of the same image in the training set. When they filtered out duplicates, DALL-E stopped drawing training examples.

The problem is that filtering the training set is naively O(n^2) and n is already extremely large for DALL-E. For LLMs, it's comically huge, plus now you have to do substring search. I've yet to hear OpenAI talk about training set deduplication in the context of LLMs.

As for the legal basis... nobody's ruled on AI training sets in the US. Even the Google Books case that I've heard cited in the past (even by myself) really only talks about searching a large corpus of text. If OpenAI's GPT models were really just a powerful search engine and not intelligent at all, they'd actually be more legally protected.

My money's still on "training is fair use", but that actually doesn't help OpenAI all that much either, because fair use is not transitive. Right now, such a ruling would mean that using AI art is Russian roulette: if your model regurgitates, the outputs are still infringing, even if the model is fair use. Novel outputs aren't entirely safe, though. A judge willing to commit the Butlerian Jihad[0] might even say that regurgitation does not matter and that all AI outputs are derivative works of the entire training set[1].

This logic would also apply in the EU. Last I checked the TDM exception only said training is legal, not that you could sell the outputs. They don't really respect jurisprudence the way the Anglosphere obsesses over "precedent", so copyright exceptions are almost always decided by legislatures and not judges over there, and the likelihood of a judge saying that all outputs are derivative works of the training set regardless of regurgitation is higher.

[0] In the sci-fi novel Dune, the Butlerian Jihad is a galaxy-wide purge of all computer technology for reasons that are surprisingly pertinent to the AI art debate.

Yes, this is also why /r/Dune banned AI art. No, I have not read Dune.

[1] If the opinion was worded poorly this would mean that even human artists taking inspiration to produce legally distinct works would be violating copyright. The idea-expression divide would be entirely overthrown in favor of a dictatorship of the creative proletariat.

[2] "Music and Film Industry Association of America" - an abbreviation coined for an April Fools joke article about the MPAA and RIAA merging together.

replies(2): >>richk4+YE >>6gvONx+k11

>>gyudin+Op
That copy (the “send”) would be evaluated under the same fair use criteria.

“Write a review of this short story: …” – probably fine.

“Rewrite this short story to have a happier ending: …” – probably not.

>>ethanb+13
> The answer is no, because you reading the article didn’t dramatically degrade its market value.

How about if you read a news article to write a competing one rewording and possibly citing it (one of the most common practices in news)?

replies(1): >>ethanb+1T

>>visarg+Yn
Would automated paraphrasing not be a derivative work of the original?

replies(1): >>visarg+E32

>>twoodf+Jc
How does that align with Google Books scanning libraries full of copyrighted text, offering full reproductions of sections of the work, and then having the supreme court declare it all to be Fair Use? I think that is a far more relevant precedent here: https://en.m.wikipedia.org/wiki/Authors_Guild,_Inc._v._Googl....

replies(2): >>twoodf+TE >>6gvONx+711

>>luma+SD
The Supreme Court declined to hear the case on appeal, which is a shade different from endorsing the decision after a hearing.

That being said, it doesn’t take a lot of effort to differentiate these cases. Google was indexing copyrighted works and providing access to limited extracts. They weren’t transforming them into new works and then selling access to those new works over APIs.

replies(1): >>luma+SW

>>kmeist+1v
> A judge willing to commit the Butlerian Jihad[0] might even say that regurgitation does not matter and that all AI outputs are derivative works of the entire training set[1].

A judge can’t “commit” the butlierian jihad. A jihad is a mass event caused by some fraction of the population believing in some cause.

Which kinda gets to a point that seems to be missed. Copyright law is not “intrinsic” - nobody thinks that copyright is a natural law - it is just a pragmatic implementation which balances various public and private goods. If the world changes such that the law no longer does a good job of balancing the various goods, then either the law will get changed or people will ignore the law.

replies(1): >>kmeist+lO

>>ethanb+13
This applies to so many things, though.

The most obvious parallel to me is YouTube. There are a ton of people ingesting books, then transforming that information into a roughly paraphrased video for people to watch for free (ish). That devalues the books they read and paraphrased, because other people don't need to read them.

Spark Notes devalue actual books in a way, because a lot of high schoolers read those instead of buying the actual book.

Search engines have also supplanted books in large part, because I don't need a whole book to answer a specific question. I don't know anyone that owns an encyclopedia anymore.

This is the next iteration of these processes. Non-novel information's market value has been degrading for decades now. A series of questions that would have cost thousands of dollars in books to answer in the 70's/80's is now free, with or without AI.

replies(1): >>llamai+TS

>>richk4+YE
Copyright is a unique case in which the law represents a bargain struck in the 1970s that hasn't been updated since. Everyone ignores it because it's nearly impossible to actually enforce copyright on individual infringers. But that doesn't mean copyright is meaningless: any activity which is large enough to be legible[0] to the state will be forced to bend itself to fit within the copyright bargain.

And AI training is extremely legible. This is not like a bunch of people downloading stuff off BitTorrent. All of the large foundation models we use were trained by a large corporation with a source of venture capital funding which could be easily shut off by a sufficiently motivated government. Weights-available and liberally licensed models exist, but most improvements on them are fine-tuning. Anonymous individuals can fine-tune an LLM or art generator with a small amount of data and compute, but they cannot make meaningful improvements on the state of the art.

So our sufficiently motivated copyright judge could at least effectively freeze AI art in time until Big Tech and the MAFIAA agree on how to properly split the proceeds from screwing over individual artists.

"Butlerian Jihad" is a term from a book, so you don't need to take "jihad" literally. However, I will point out that there is a significant fraction of the population that does want to see AI permanently banned from creative endeavors. The loss of ownership over their work from having it be in the training set is a factor, but their main argument is that they specifically want to keep their current jobs as they are. They do not want to be replaced with AI, nor do they want to replace their existing drawing work with SEO keyword stuffed text-to-image prompts.

[0] https://en.wikipedia.org/wiki/Seeing_Like_a_State

replies(1): >>richk4+lD1

>>everfo+iF
LLMs are attracting so much positive attention because they are likely to be a huge, huge step change improvement than all those methods you mention.

For that same exact reason, it’s totally reasonable they’re attracting unique amounts of negative attention too.

You can’t have it both ways: yes LLMs are going to change information retrieval the way nothing else has before, but no it’s actually just like all the other things in terms of their impact on incentive structures.

FWIW I don’t really know where I land on this issue. I just find it totally incoherent to believe in the bull case of “this will transform everything” while also portraying it all as par for the course when discussing potential negatives.

Just because Spark Notes didn’t obviously manage to kill valuable parts of our information ecosystem and economy does not mean that Spark Notes x 10,000,000 will not.

>>cma+Iy
How about it? Do you not think it incurs a lot of negative effects?

replies(1): >>cma+Cc7

>>twoodf+TE
OpenAI is also providing access to limited extracts. Google wasn't selling this over an API, they were providing "free" access to it while displaying ads to the user. Would the courts see this manner of monetization to be different enough that settled case law wouldn't apply?

replies(1): >>twoodf+r91

>>luma+SD
Google also bought copies of each book, I believe, which makes it another step removed from standard ML practice.

>>kmeist+1v
> The problem is that filtering the training set is naively O(n^2)

There are standard ways to do it that are O(n), FYI.

>>luma+SW
OpenAI isn’t doing anything like what Google was doing with Books. It’s not hard for laymen to see that, and it’s going to be obvious to any judge who hears a case.

Imagine OpenAI had invented a software program that turned any written text into an animated cartoon enacting the text. That would obviously be creating a derivative work and outside fair use bounds. That they mix a bunch of works (copyrighted and otherwise) into a piece of software doesn’t allow them to escape that basic analysis.

Google showed a “clip” of the original work, no different in scope than Siskel & Ebert showing a clip of a film as they reviewed it. The uses are not comparable.

>>brooks+62
Yes it is considered fair use but it's also completely irrelevant because we're talking about a computer program not a person.

>>tharku+2c
Your sentiment is exactly what I intended, albeit I was terse and a little facetious. ChatGPT is like introducing a bunch of new skilled labor, it’s just for the first time this skilled labor isn’t human. The fact that this skilled labor learned from copyrighted material is like saying human labor learned from copyrighted material.

>>kmeist+lO
Butlerian jihad is a good reference point. Something so bad happened that a large enough portion of the population was convinced to destroy thinking machines, and this no-computer norm was held in human society for a crazy long time (been too long for me to remember how long elapsed before Chapterhouse, which I think is the book where thinking machines start returning). It was a core belief of humanity that computers were bad, not a law imposed by a judge or legislature.

So say a US judge did impose severe restrictions on LLMs through US copyright law. The giant companies that are using LLMs will just move to another country. And just like tax law, others will be happy to have them. Would the US start blocking inbound internet traffic from countries that don’t have the same interpretation of copyright? That seems very unlikely.

The point is that the only way LLMs get the butlerian jihad treatment is if the people rise up against them. Right now, that is nowhere close to happening.

>>mattbe+RD
So you think any paraphrase of a copyrighted phrase is in copyright violation? That's like owning the idea itself. Is any utterance similar to this one now forbidden?

replies(1): >>mattbe+xm3

>>visarg+E32
I think if you automate paraphrasing from an original work to use that original work on bulk somehow, yes.

How do you even automate paraphrasing without training it on lots of original work? It's infringement all the way down.

>>ethanb+1T
What's the alternative, each news event is first to publish exclusivity regardless of quality? No synthesizing multiple stories into a linked narrative?