zlacker

Why shouldn't the creators of the training content get anything for their efforts? With some guiderails in place to establish what is fair compensation, Fair Use can remain as-is.

replies(3): >>apante+G1 >>solard+g4 >>insani+a9

>>strong+(OP)
The issue as I see it is that every bit of data that the model ingested in training has affected what the model _is_ and therefore every token of output from the model has benefited from every token of input. When you receive anything from an LLM, you are essentially receiving a customized digest of all the training data. The second issue is that it takes an enormous amount of training data to train a model. In order to enable users to extract ‘anything’ from the model, the model has to be trained on ‘everything’. So I think these models should be looked at as public goods that consume everything and can produce anything. To have to keep a paper trail on the ‘everything’ part (the input) and send a continuous little trickle of capital to all of the sources is missing the point. That’s like a person having to pay a little bit of money to all of their teachers and mentors and everyone they’ve learned from every time they benefit from what they learned.

OpenAI isn’t marching into the online news space and posting NY Times content verbatim in an effort to steal market share from the NY Times. OpenAI is in the business of turning ‘everything’ (input tokens) into ‘anything’ (output tokens). If someone manages to extract a preserved chunk of input tokens, that’s more like an interesting edge case of the model. It’s not what the model is in the business of doing.

Edit: typo

replies(1): >>vharuc+BT

>>strong+(OP)
Everyone learns from papers. That's the point of them, isn't it? Except we pay, what, $4 per Sunday paper or $10/mo for the digital edition? Why should a robot have to pay much more just because it's better at absorbing information?

replies(2): >>awwaii+I6 >>layer8+M6

>>solard+g4
That would be a funny settlement -- "OK, so $10/month, and we'll go back to 1950 to be fair, so that'll be .... $8760"

>>solard+g4
Because the issue isn’t the intake, it’s the output, where your analogy breaks down. If you could clone the brain of someone who was “trained” on decades of NYT and could reproduce its information on demand at scale, we’d be discussing similar issues.

replies(1): >>evanda+un

>>strong+(OP)
> Why shouldn't the creators of the training content get anything for their efforts?

Well, they didn't charge for it, right? They're retroactively asking for money, but they could have just locked their content behind a strict paywall or had a specific licensing agreement enforceable ahead of time. They could do that going forward, but how is it fair for them to go back and say that?

And the issue isn't "You didn't pay us" it's "This infringes our copyright", which historically the answer has been "no it doesn't".

>>layer8+M6
Your analogy doesn't make sense either.

If we could clone the brain of someone I hardly think we'd be discussing their vast knowledge of something so insignificant as the NYT. I don't think we should care that much about an AI's vast knowledge of the NYT either or why it matters.

If all these journalism companies don't want to provide the content for free they're perfectly capable of throwing the entire website behind a login screen. Twitter was doing it at one point. In a similar vein, I have no idea why newspapers are complaining about readership while also paywalling everything in sight. How exactly do they want or expect to be paid?

replies(1): >>015a+wP1

>>apante+G1
What's wrong with paying copyright holders, then? If OpenAI's models are so much more valuable than the sum of the individual inputs' values, why can't the company profit off that margin?

>That’s like a person having to pay a little bit of money to all of their teachers and mentors and everyone they’ve learned from every time they benefit from what they learned.

I could argue that public school teachers are paid by previous students. Not always the ones they taught, but still. But really, this is a very new facet of copyright law. It's a stretch to compare it with existing conventions, and really off to anthropomorphize LLMs by equating them to human students.

replies(1): >>apante+z01

>>vharuc+BT
> What's wrong with paying copyright holders, then?

There’s nothing wrong with it. But it would make it vastly more cumbersome to build training sets in the current environment.

If the law permits producers of content to easily add extra clauses to their content licenses that say “an LLM must pay us to train on this content”, you can bet that that practice would be near-universally adopted because everyone wants to be an owner. Almost all content would become AI-unfriendly. Almost every token of fresh training content would now potentially require negotiation, royalty contracts, legal due diligence, etc. It’s not like OpenAI gets their data from a few sources. We’re talking about millions of sources, trillions of tokens, from all over the internet — forums, blogs, random sites, repositories, outlets. If OpenAI were suddenly forced to do a business deal with every source of training data, I think that would frankly kill the whole thing, not just slow it down.

It would be like ordering Google to do a business deal with the webmaster of every site they index. Different business, but the scale of the dilemma is the same. These companies crawl the whole internet.

>>evanda+un
Most of the NYT is behind a signin screen; the classic "you can read the first paragraph of the page but pay us to see more" thing.

There is significant evidence (220,000 pages worth) in their lawsuit that ChatGPT was trained on text beyond that paywall.