> 1) The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes
> 2) The nature of the copyrighted work
> 3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole
> 4) The effect of the use upon the potential market for or value of the copyrighted work
[emphasis from TFA]
HN always talks about derivative work and transformativeness, but never about these. The fourth one especially seems clear in its implications for models.
Regardless, it makes it seem much less clear cut than people here often say.
If you look at the core argument in favour of fair use, it's that "LLMs do not copy the training data", yet this is obviously false.
For Github copilot and ChatGPT examples of it reciting large sections of training data are well known. Plenty can be found on HN. It doesn't generate a new valid windows serial key on the fly, it's memorized them.
If one wants to be cynical, it's not hard to see OpenAI/etc patching in filters to remove copyrighted content from the output precisely because it's legally catastrophic for their "fair use" claim to have the model spit out copyrighted content. As this is both copyright infringement by itself, and evidence that no matter how the internals of these models work, they store some of the training data anyway.
The problem is that filtering the training set is naively O(n^2) and n is already extremely large for DALL-E. For LLMs, it's comically huge, plus now you have to do substring search. I've yet to hear OpenAI talk about training set deduplication in the context of LLMs.
As for the legal basis... nobody's ruled on AI training sets in the US. Even the Google Books case that I've heard cited in the past (even by myself) really only talks about searching a large corpus of text. If OpenAI's GPT models were really just a powerful search engine and not intelligent at all, they'd actually be more legally protected.
My money's still on "training is fair use", but that actually doesn't help OpenAI all that much either, because fair use is not transitive. Right now, such a ruling would mean that using AI art is Russian roulette: if your model regurgitates, the outputs are still infringing, even if the model is fair use. Novel outputs aren't entirely safe, though. A judge willing to commit the Butlerian Jihad[0] might even say that regurgitation does not matter and that all AI outputs are derivative works of the entire training set[1].
This logic would also apply in the EU. Last I checked the TDM exception only said training is legal, not that you could sell the outputs. They don't really respect jurisprudence the way the Anglosphere obsesses over "precedent", so copyright exceptions are almost always decided by legislatures and not judges over there, and the likelihood of a judge saying that all outputs are derivative works of the training set regardless of regurgitation is higher.
[0] In the sci-fi novel Dune, the Butlerian Jihad is a galaxy-wide purge of all computer technology for reasons that are surprisingly pertinent to the AI art debate.
Yes, this is also why /r/Dune banned AI art. No, I have not read Dune.
[1] If the opinion was worded poorly this would mean that even human artists taking inspiration to produce legally distinct works would be violating copyright. The idea-expression divide would be entirely overthrown in favor of a dictatorship of the creative proletariat.
[2] "Music and Film Industry Association of America" - an abbreviation coined for an April Fools joke article about the MPAA and RIAA merging together.