zlacker

I don't see a judge ruling that training a model on copyrighted works to be infringement, I think (hope) that that is ruled to be protected as fair use. It's the LLM output behaviour, specifically the model's willingness to reproduce verbatim text which is clearly a violation of copyright, and should rightfully result in royalties being paid out. It also seems like something that should be technically feasible to filter out or cite, but with a serious cost (both in compute and in latency for the user). Verbatim text should be easy to identify, although it may require a Google Search - level amount of indexing and compute. As for summaries and text "in the style of" NYT or others, that's the tricky part. Not sure there's any high-precision way to identify that on the output side of an LLM, though I can imagine a GAN trained to do so (erring on the side of false-positives). Filtering-out suspiciously infringe-ish outputs and re-running inference seems much more solvable than perfect citations for non-verbatim output.