An important direction would be to train copyright attribution models, and diff-models to detect when a work is infringing on another, by direct comparison. They would be useful to filter both the training set and the model outputs.
How do you even automate paraphrasing without training it on lots of original work? It's infringement all the way down.