If you look at the core argument in favour of fair use, it's that "LLMs do not copy the training data", yet this is obviously false.
For Github copilot and ChatGPT examples of it reciting large sections of training data are well known. Plenty can be found on HN. It doesn't generate a new valid windows serial key on the fly, it's memorized them.
If one wants to be cynical, it's not hard to see OpenAI/etc patching in filters to remove copyrighted content from the output precisely because it's legally catastrophic for their "fair use" claim to have the model spit out copyrighted content. As this is both copyright infringement by itself, and evidence that no matter how the internals of these models work, they store some of the training data anyway.
The Supreme Court hasn’t ruled on a software case like this, as far as I know. But given the recent 7-2 decision against Andy Warhol’s estate for his copying of photographs of Prince, this doesn’t seem like a Court that’s ready to say copying terabytes of unlicensed material for a commercial purpose is OK.
I’m going to guess this ends with Congress setting up some kind of clearinghouse for copyrighted training material: You opt in to be included, you get fees from OpenAI when they use what you added. This isn’t unprecedented: Congress set up special rules and processes for things like music recordings repeatedly over the years.
https://scholarship.law.edu/cgi/viewcontent.cgi?referer=&htt...
The problem is that filtering the training set is naively O(n^2) and n is already extremely large for DALL-E. For LLMs, it's comically huge, plus now you have to do substring search. I've yet to hear OpenAI talk about training set deduplication in the context of LLMs.
As for the legal basis... nobody's ruled on AI training sets in the US. Even the Google Books case that I've heard cited in the past (even by myself) really only talks about searching a large corpus of text. If OpenAI's GPT models were really just a powerful search engine and not intelligent at all, they'd actually be more legally protected.
My money's still on "training is fair use", but that actually doesn't help OpenAI all that much either, because fair use is not transitive. Right now, such a ruling would mean that using AI art is Russian roulette: if your model regurgitates, the outputs are still infringing, even if the model is fair use. Novel outputs aren't entirely safe, though. A judge willing to commit the Butlerian Jihad[0] might even say that regurgitation does not matter and that all AI outputs are derivative works of the entire training set[1].
This logic would also apply in the EU. Last I checked the TDM exception only said training is legal, not that you could sell the outputs. They don't really respect jurisprudence the way the Anglosphere obsesses over "precedent", so copyright exceptions are almost always decided by legislatures and not judges over there, and the likelihood of a judge saying that all outputs are derivative works of the training set regardless of regurgitation is higher.
[0] In the sci-fi novel Dune, the Butlerian Jihad is a galaxy-wide purge of all computer technology for reasons that are surprisingly pertinent to the AI art debate.
Yes, this is also why /r/Dune banned AI art. No, I have not read Dune.
[1] If the opinion was worded poorly this would mean that even human artists taking inspiration to produce legally distinct works would be violating copyright. The idea-expression divide would be entirely overthrown in favor of a dictatorship of the creative proletariat.
[2] "Music and Film Industry Association of America" - an abbreviation coined for an April Fools joke article about the MPAA and RIAA merging together.
“Write a review of this short story: …” – probably fine.
“Rewrite this short story to have a happier ending: …” – probably not.
That being said, it doesn’t take a lot of effort to differentiate these cases. Google was indexing copyrighted works and providing access to limited extracts. They weren’t transforming them into new works and then selling access to those new works over APIs.
A judge can’t “commit” the butlierian jihad. A jihad is a mass event caused by some fraction of the population believing in some cause.
Which kinda gets to a point that seems to be missed. Copyright law is not “intrinsic” - nobody thinks that copyright is a natural law - it is just a pragmatic implementation which balances various public and private goods. If the world changes such that the law no longer does a good job of balancing the various goods, then either the law will get changed or people will ignore the law.
And AI training is extremely legible. This is not like a bunch of people downloading stuff off BitTorrent. All of the large foundation models we use were trained by a large corporation with a source of venture capital funding which could be easily shut off by a sufficiently motivated government. Weights-available and liberally licensed models exist, but most improvements on them are fine-tuning. Anonymous individuals can fine-tune an LLM or art generator with a small amount of data and compute, but they cannot make meaningful improvements on the state of the art.
So our sufficiently motivated copyright judge could at least effectively freeze AI art in time until Big Tech and the MAFIAA agree on how to properly split the proceeds from screwing over individual artists.
"Butlerian Jihad" is a term from a book, so you don't need to take "jihad" literally. However, I will point out that there is a significant fraction of the population that does want to see AI permanently banned from creative endeavors. The loss of ownership over their work from having it be in the training set is a factor, but their main argument is that they specifically want to keep their current jobs as they are. They do not want to be replaced with AI, nor do they want to replace their existing drawing work with SEO keyword stuffed text-to-image prompts.
There are standard ways to do it that are O(n), FYI.
Imagine OpenAI had invented a software program that turned any written text into an animated cartoon enacting the text. That would obviously be creating a derivative work and outside fair use bounds. That they mix a bunch of works (copyrighted and otherwise) into a piece of software doesn’t allow them to escape that basic analysis.
Google showed a “clip” of the original work, no different in scope than Siskel & Ebert showing a clip of a film as they reviewed it. The uses are not comparable.
So say a US judge did impose severe restrictions on LLMs through US copyright law. The giant companies that are using LLMs will just move to another country. And just like tax law, others will be happy to have them. Would the US start blocking inbound internet traffic from countries that don’t have the same interpretation of copyright? That seems very unlikely.
The point is that the only way LLMs get the butlerian jihad treatment is if the people rise up against them. Right now, that is nowhere close to happening.