> 1) The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes
> 2) The nature of the copyrighted work
> 3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole
> 4) The effect of the use upon the potential market for or value of the copyrighted work
[emphasis from TFA]
HN always talks about derivative work and transformativeness, but never about these. The fourth one especially seems clear in its implications for models.
Regardless, it makes it seem much less clear cut than people here often say.
If you look at the core argument in favour of fair use, it's that "LLMs do not copy the training data", yet this is obviously false.
For Github copilot and ChatGPT examples of it reciting large sections of training data are well known. Plenty can be found on HN. It doesn't generate a new valid windows serial key on the fly, it's memorized them.
If one wants to be cynical, it's not hard to see OpenAI/etc patching in filters to remove copyrighted content from the output precisely because it's legally catastrophic for their "fair use" claim to have the model spit out copyrighted content. As this is both copyright infringement by itself, and evidence that no matter how the internals of these models work, they store some of the training data anyway.
The Supreme Court hasn’t ruled on a software case like this, as far as I know. But given the recent 7-2 decision against Andy Warhol’s estate for his copying of photographs of Prince, this doesn’t seem like a Court that’s ready to say copying terabytes of unlicensed material for a commercial purpose is OK.
I’m going to guess this ends with Congress setting up some kind of clearinghouse for copyrighted training material: You opt in to be included, you get fees from OpenAI when they use what you added. This isn’t unprecedented: Congress set up special rules and processes for things like music recordings repeatedly over the years.
https://scholarship.law.edu/cgi/viewcontent.cgi?referer=&htt...
That being said, it doesn’t take a lot of effort to differentiate these cases. Google was indexing copyrighted works and providing access to limited extracts. They weren’t transforming them into new works and then selling access to those new works over APIs.
Imagine OpenAI had invented a software program that turned any written text into an animated cartoon enacting the text. That would obviously be creating a derivative work and outside fair use bounds. That they mix a bunch of works (copyrighted and otherwise) into a piece of software doesn’t allow them to escape that basic analysis.
Google showed a “clip” of the original work, no different in scope than Siskel & Ebert showing a clip of a film as they reviewed it. The uses are not comparable.