But it IS possible to train a model for that. In fact, I believe ML models can be fantastic "code archaeologists", giving us insights into not just direct copying, but inspiration and idioms as well. They don't just have the code, they have commit histories with timestamps.
A causal fact which these models could incorporate, is that we know data from the past wasn't influenced by data from the future. I believe that is a lever to pry open a lot of wondrous discoveries, and I can't wait until a model with this causal assumption is let loose on Spotify's catalog, and we get a computer's perspective on who influenced who.
But in the meantime, discovering where copy-pasted code originated should be a lot easier.
It would be much more valuable for people who care about the truth.