People keep saying this without defining what exactly they mean. This is a technical topic, and it requires technical explanations. What do you think "mostly copying" means when you say it?
Because there isn't a shred of original pixel data reproduced from training data through to output data by any of the diffusion models. In fact there isn't enough data in the model weights to reproduce any images at all, without adding a random noise field.
> The benefits of allowing this will be had by a very small group of corporations and individuals
You are also grossly mistaken here. The benefits of heavily restricting this, will be had by a very small group of corporations and individuals. See, everyone currently comes around to "you should be able to copyright a style" as the solution to the "problem".
Okay - let's game this out. US Copyright lasts for the life of author plus 70 years. No copyright work today will enter public domain until I am dead, my children are dead, and probably my grandchildren as well. But copyright can be traded and sold. And unlike individuals, who do die, corporations as legal entities do not. And corporations can own copyright.
What is the probability that any particular artistic "style" - however you might define that (whole other topic really) - is truly unique? I mean, people don't generally invent a style on their own - they build it up from studying other sources, and come up with a mix. Whatever originality is in there is more a function of mutation of their ability to imitate styles then anything else - art students, for example, regularly will do studies of famous artists and intentionally try to copy their style as best they can. A huge amount of content tagged "Van Gough" in Stable Diffusion is actually Van Gough look-alikes, or content literally labelled "X in the style of Van Gough". It had nothing to do with them original man at all.
I mean, zero - by example - it's zero. There are no truly original art styles. Which means in a world with copyrightable art styles, all art styles eventually end up as a part of corporate owned styles. Or the opposite is also possible - maybe they all end up as public domain. But in both cases the answer is the same: if "style" becomes a copyrightable term, and AIs can reproduce it in some way which you can prove, then literal "prior art" of any particular style will invariably be an existing part of an AI dataset. Any new artist with a unique style will invariably be found to simply be 95% a blend of other known styles from an AI which has existed for centuries and been producing output constantly.
In the public domain world, we wind up approximately where we are now: every few decades old styles get new words keyed into them as people want to keep up with the times of some new rising artist who's captured a unique blend in the zeitgeist. In the corporate world though, the more likely one, Disney turns up with it's lawyers and says "we're taking 70% or we're taking it all".
I disagree that there is no originality in art styles, human creativity amounts to more than just copying other people. There is no way a current gen AI model would be able to create truly original mathematics or physics, it is just able to reproduce facsimile and convincing bullshit that looks like it. Before long the models will probably able to do formal reasoning in a system like Lean 4, but that is a long way of from truly inventive mathematics or physics.
Art is more subtle, but what these models produce is mostly "kitsch". It is telling that their idea of "aesthetics" involves anime fan art and other commercial work. Anyways, I don't like the commercial aspects of copyright all that much, but what I like is humans over machines. I believe in freely reusing and building on the work of others, but not on machines doing the same. Our interests are simply not aligned at this point.
When Alpha Go adds one of its own self-vs-self games to its training database, it is adding a genuine game. The rules are followed. One side wins. The winning side did something right.
Perhaps the standard of play is low. One side makes some bad moves, the other side makes a fatal blunder, the first side pounces and wins. I was surprised that they got training through self play to work; in the earlier stages the player who wins is only playing a little better than the player who loses and it is hard to work out what to learn. But the truth of Go is present in the games and not diluted beyond recovery.
But a LLM is playing a post-modern game of intertextuality. It doesn't know that there is a world beyond language to which language sometimes refers. Is what a LLM writes true or false? It is unaware of either possibility. If its own output is added to the training data, that creates a fascinating dynamic. But where does it go? Without Alpha Go's crutch of the "truth" of which player won the game according to the hard coded rules, I think the dynamics have no anchorage in reality and would drift, first into surrealism and then psychosis.
One sees that AlphaGo is copying the moves that it was trained on and a LLM is also copying the moves that is was trained on and that these two things are not the same.