zlacker

The shady world of Brave selling copyrighted data for AI training

submitted by rand0m+(OP) on 2023-07-15 11:59:25 | 261 points 123 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
◧◩
3. homele+Ae[view] [source] [discussion] 2023-07-15 13:44:23
>>verisi+Zd
Do you mean like trap streets? Seems like a good idea for model makers

https://en.wikipedia.org/wiki/Trap_street

21. hartat+am[view] [source] 2023-07-15 14:34:16
>>rand0m+(OP)
> Simply observe the event in which a user does a query q in Brave and then, within one hour, does the same query on a different search engine. What we do is to move the script that detects bad-queries to the browser, run it against the queries that the user does in real-time and then, when all conditions are met, send the following data back to our servers.

Wait. Brave browser sends back to Brave Search engine about your browsing? Other search engines usage, but also crawl pages on your computer to help build their search index?

Ref: https://github.com/brave/web-discovery-project/blob/main/mod...

50. nieman+SB[view] [source] 2023-07-15 16:07:14
>>rand0m+(OP)
This discussion on fair use are always quite anglocentric.

Atricle 3 and 4 of the EU 'Copyright in the Digital Single Market' give data miners quite extensive rights.

Move operation to the EU, train a foundational model, than train a constitutional model based on that.

As much as I hate the upcoming AI regulation, the CDSM is solid.

https://academic.oup.com/grurint/article/71/8/685/6650009 https://eur-lex.europa.eu/eli/dir/2019/790/oj

Update: Fixed wrong link

◧◩
51. w0ts0n+wC[view] [source] [discussion] 2023-07-15 16:11:13
>>hartat+am
This is (importantly) opt-in.

"Brave doesn’t follow the sneaky practices of other big tech search engines. The Web Discovery Project is opt-in, and the data collected under the Web Discovery Project has specific protections to ensure anonymity." per https://support.brave.com/hc/en-us/articles/4409406835469-Wh...

◧◩◪◨
52. w0ts0n+fD[view] [source] [discussion] 2023-07-15 16:14:14
>>hartat+8t
This is opt-in only. https://support.brave.com/hc/en-us/articles/4409406835469-Wh...
◧◩◪
55. twoodf+9F[view] [source] [discussion] 2023-07-15 16:22:48
>>_fbpp+dw
It actually doesn’t even matter if LLMs reproduce copyrighted data from their training. The issue is that a human copied the data from its source into memory for use in training, and this copy was likely not fair use under cases like MAI Systems.

The Supreme Court hasn’t ruled on a software case like this, as far as I know. But given the recent 7-2 decision against Andy Warhol’s estate for his copying of photographs of Prince, this doesn’t seem like a Court that’s ready to say copying terabytes of unlicensed material for a commercial purpose is OK.

I’m going to guess this ends with Congress setting up some kind of clearinghouse for copyrighted training material: You opt in to be included, you get fees from OpenAI when they use what you added. This isn’t unprecedented: Congress set up special rules and processes for things like music recordings repeatedly over the years.

https://scholarship.law.edu/cgi/viewcontent.cgi?referer=&htt...

◧◩◪
60. tokai+ZH[view] [source] [discussion] 2023-07-15 16:35:43
>>rglull+Cu
Seems you're wrong. Here's an archive link to Braves page in '16 were their planned add replacement is explained:

https://archive.md/W0k4j

◧◩◪
67. nieman+RM[view] [source] [discussion] 2023-07-15 17:00:52
>>pedroc+OI
My reading of the relevant laws would actually lead me to believe that this is not a problem, as long as those reproductions are not returned and the eights holder did not opt out. But courts might decide differently.

Regarding the copyright of returned material here is a good discussion:

https://copyrightblog.kluweriplaw.com/2023/05/09/generative-...

◧◩◪◨
71. yjftsj+wR[view] [source] [discussion] 2023-07-15 17:26:54
>>hartat+8t
> Even Google doesn't do that

At least Bing did, though. >>2169793

◧◩◪◨
74. Timber+eT[view] [source] [discussion] 2023-07-15 17:36:46
>>throwa+Nt
If am not wrong, Opera is owned by some Chinese company and they are known for doing some really shady stuff [0][1] in African countries.

[0] https://blogs.opera.com/africa/2022/05/free-data-with-opera-...

[1] https://www.androidpolice.com/2020/01/21/opera-predatory-loa...

◧◩
86. jonath+n31[view] [source] [discussion] 2023-07-15 18:54:26
>>k__+DB
I'm Sampson, from the Brave team. The Web Discovery Project is a clever approach. For Brave to compete with Google, and offer a truly novel index of the Web, a novel approach must be taken. The WDP is an opt-in, privacy-preserving approach which gives Brave a fighting chance against the Search incumbants. Due to our preference of "Can't be evil" over "Don't be evil," the WDP is not only designed with privacy and anonymity as a prerequisite, but it is also open-source for public scrutiny and evaluation: https://github.com/brave/web-discovery-project.
◧◩◪◨
89. luma+i61[view] [source] [discussion] 2023-07-15 19:14:19
>>twoodf+9F
How does that align with Google Books scanning libraries full of copyrighted text, offering full reproductions of sections of the work, and then having the supreme court declare it all to be Fair Use? I think that is a far more relevant precedent here: https://en.m.wikipedia.org/wiki/Authors_Guild,_Inc._v._Googl....
◧◩◪◨⬒
96. kmeist+Lg1[view] [source] [discussion] 2023-07-15 20:31:10
>>richk4+o71
Copyright is a unique case in which the law represents a bargain struck in the 1970s that hasn't been updated since. Everyone ignores it because it's nearly impossible to actually enforce copyright on individual infringers. But that doesn't mean copyright is meaningless: any activity which is large enough to be legible[0] to the state will be forced to bend itself to fit within the copyright bargain.

And AI training is extremely legible. This is not like a bunch of people downloading stuff off BitTorrent. All of the large foundation models we use were trained by a large corporation with a source of venture capital funding which could be easily shut off by a sufficiently motivated government. Weights-available and liberally licensed models exist, but most improvements on them are fine-tuning. Anonymous individuals can fine-tune an LLM or art generator with a small amount of data and compute, but they cannot make meaningful improvements on the state of the art.

So our sufficiently motivated copyright judge could at least effectively freeze AI art in time until Big Tech and the MAFIAA agree on how to properly split the proceeds from screwing over individual artists.

"Butlerian Jihad" is a term from a book, so you don't need to take "jihad" literally. However, I will point out that there is a significant fraction of the population that does want to see AI permanently banned from creative endeavors. The loss of ownership over their work from having it be in the training set is a factor, but their main argument is that they specifically want to keep their current jobs as they are. They do not want to be replaced with AI, nor do they want to replace their existing drawing work with SEO keyword stuffed text-to-image prompts.

[0] https://en.wikipedia.org/wiki/Seeing_Like_a_State

◧◩◪
99. Arclig+xo1[view] [source] [discussion] 2023-07-15 21:31:54
>>2Gkash+1t
Mullvad is actually a Firefox fork and it directly uses Tor's privacy enhancements[0] to Firefox for a private web browsing experience. As a matter of fact, it really looks like Tor Browser but with a VPN baked in instead of Tor.

[0] https://mullvad.net/en/browser

104. ricard+xx1[view] [source] 2023-07-15 23:01:14
>>rand0m+(OP)
My entirely biased opinion is https://www.mojeek.com/ - a traditional search engine crawler (as in, follow links on the web) that identifies its user agent. Dead Simple. The open web, you can search on it.
◧◩
110. siquic+KF1[view] [source] [discussion] 2023-07-16 00:26:14
>>411111+hn
There are multiple ways you can pay Brave.

https://brave.com/firewall-vpn/ https://account.brave.com/?intent=checkout&product=search https://brave.com/search/api/

[go to top]