The shady world of Brave selling copyrighted data for AI training

>>rand0m+(OP)
> Fair use is a doctrine in the law of the United States that allows limited use of copyrighted material without requiring permission from the rights holders. It provides for the legal, non-licensed citation or incorporation of copyrighted material in another author's work under a four-factor balancing test:

> 1) The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes

> 2) The nature of the copyrighted work

> 3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole

> 4) The effect of the use upon the potential market for or value of the copyrighted work

[emphasis from TFA]

HN always talks about derivative work and transformativeness, but never about these. The fourth one especially seems clear in its implications for models.

Regardless, it makes it seem much less clear cut than people here often say.

>>6gvONx+qs
Microsoft is gambling on the hope that model training will be ruled fair use. This makes it seem that outcome is unlikely.

>>flango+at
Do you think a human learning something from reading is fair use? Or are we all copyright violators because reading that article altered our connectomes, and we may recall parts of it later?

>>brooks+wu
The point being raised is quite specific. Not sure if you’re willingly ignoring it or what?

The answer is no, because you reading the article didn’t dramatically degrade its market value.

An AI ingesting all content on the internet and then being ultra-effective at frontrunning that content for a large number of future readers does degrade its market value (and subsumes it into the model’s value).

>>ethanb+rv
This applies to so many things, though.

The most obvious parallel to me is YouTube. There are a ton of people ingesting books, then transforming that information into a roughly paraphrased video for people to watch for free (ish). That devalues the books they read and paraphrased, because other people don't need to read them.

Spark Notes devalue actual books in a way, because a lot of high schoolers read those instead of buying the actual book.

Search engines have also supplanted books in large part, because I don't need a whole book to answer a specific question. I don't know anyone that owns an encyclopedia anymore.

This is the next iteration of these processes. Non-novel information's market value has been degrading for decades now. A series of questions that would have cost thousands of dollars in books to answer in the 70's/80's is now free, with or without AI.

zlacker