zlacker

> What's wrong with paying copyright holders, then?

There’s nothing wrong with it. But it would make it vastly more cumbersome to build training sets in the current environment.

If the law permits producers of content to easily add extra clauses to their content licenses that say “an LLM must pay us to train on this content”, you can bet that that practice would be near-universally adopted because everyone wants to be an owner. Almost all content would become AI-unfriendly. Almost every token of fresh training content would now potentially require negotiation, royalty contracts, legal due diligence, etc. It’s not like OpenAI gets their data from a few sources. We’re talking about millions of sources, trillions of tokens, from all over the internet — forums, blogs, random sites, repositories, outlets. If OpenAI were suddenly forced to do a business deal with every source of training data, I think that would frankly kill the whole thing, not just slow it down.

It would be like ordering Google to do a business deal with the webmaster of every site they index. Different business, but the scale of the dilemma is the same. These companies crawl the whole internet.