The shady world of Brave selling copyrighted data for AI training

>>rand0m+(OP)
This discussion on fair use are always quite anglocentric.

Atricle 3 and 4 of the EU 'Copyright in the Digital Single Market' give data miners quite extensive rights.

Move operation to the EU, train a foundational model, than train a constitutional model based on that.

As much as I hate the upcoming AI regulation, the CDSM is solid.

https://academic.oup.com/grurint/article/71/8/685/6650009 https://eur-lex.europa.eu/eli/dir/2019/790/oj

Update: Fixed wrong link

>>nieman+SB
It's not clear that "data mining" covers this use. These models are huge, big enough that they can just contain direct copies of copyrighted works. They've been shown to reproduce them relatively easily. The argument is that they've actually generalized enough or learned enough that they're now no longer the sum of the dataset. I can definitely see that being possible but the way the technology works it's really hard to know if that has happened or if what's happening instead is a bunch of copyright washing.

There are some things that would make for good faith displays by the players in the space. For example, Microsoft has been investing a lot and yet their code offering is not trained on their internal code base. Same for Google. Start by doing that and I'll entertain the argument that your tools are fair use or data mining.

zlacker