zlacker

I think there are ways around it. The simplest would be to generate replacement data, for example by paraphrasing the original, or summarising, or turning it into question-answer pairs. In this new format it can serve as training data for a clean LLM. Of course the public domain data would be used directly, no need to go synthetic there.

An important direction would be to train copyright attribution models, and diff-models to detect when a work is infringing on another, by direct comparison. They would be useful to filter both the training set and the model outputs.

replies(1): >>mattbe+Tf

>>visarg+(OP)
Would automated paraphrasing not be a derivative work of the original?

replies(1): >>visarg+GF1

>>mattbe+Tf
So you think any paraphrase of a copyrighted phrase is in copyright violation? That's like owning the idea itself. Is any utterance similar to this one now forbidden?

replies(1): >>mattbe+zY2

>>visarg+GF1
I think if you automate paraphrasing from an original work to use that original work on bulk somehow, yes.

How do you even automate paraphrasing without training it on lots of original work? It's infringement all the way down.