a) In many closely comparable scenarios, yes, it’s copyright infringement. When Francis Ford Coppola made The Godfather film, he couldn’t just be “inspired” by Puzo’s book. If the story or characters or dialog are similar enough, he has to pay Puzo, even if the work he created was quite different and not a literal “copy”.
b) Training an LLM isn’t like giving someone a book. Among other things, it involves making a derivative copy into GPU memory. This copy is not a transitory copy in service of a fair use, nor likely a fair use in itself, nor licensed by the rights-holder.
Training is almost certainly fair use, so it's exactly a transitory copy in service of fair use. Training, other than the brief "transitory copy" you mention is not copying, it's making a minuscule algorithmic adjustment based on fleeting exposure to the data.
Congress took the circuit holding in MAI Systems seriously enough to carve out a new fair use exception for copying software—entirely within the memory system of a licensed user—in service of debugging it.
If it took an act of Congress to make “unlicensed” debugging a fair use copy…
Seems vastly transitory and since the output cannot be copyrighted, does no harm to any work it “trained” on.
I don't think that you can copyright a plot or story in any country can you?
If he re-wrote the story with different characters and different lines he wouldn't have had to to pay Puzo. I'm sure it would have been frowned upon if its too close, but legally ok.
If Microsoft truly believes that the trained output doesn't violate copyright then it should be forced to prove that by training it on all its internal source code, including Windows.
If it disgorges parts of NYT articles, how do we know this is not a common phrase, or the article isn't referenced verbatim on another, unpaid site?
I agree that if it uses the whole content of their articles for training, then NYT should get paid, but I'm not sure that they specifically trained on "paid NYT articles" as a topic, though I'm happy to be corrected.
I also think that companies and authors extremely overvalue the tiny fragments of their work in the huge pool of training data, I think there's a bit of a "main character" vibe going on.