A ML model is clearly a derivative work of its input.
Here's what I think would be fair:
Anyone who holds copyright in something used as part of a training corpus is owed a proportional share of the cash flow resulting from use of the resulting models. (Cash flow, not profits, because it's too easy to use accounting tricks to make profits disappear).
In the case of intermediaries (e.g., social media like reddit & twitter) those intermediaries could take a cut before passing it on to the original authors.
Obviously hellishly difficult to administer so it's unlikely to happen but I don't see a better answer.
An important direction would be to train copyright attribution models, and diff-models to detect when a work is infringing on another, by direct comparison. They would be useful to filter both the training set and the model outputs.
Do you mean this in a copying sense or a mathematical sense?
What if it's only storing 1 byte per input document?
How do you even automate paraphrasing without training it on lots of original work? It's infringement all the way down.