The shady world of Brave selling copyrighted data for AI training

>>rand0m+(OP)
How long until IP works its way onto ai training data or ais themselves? Ie that for some specific instance, the training is intentionally wrong, so as to check and prove that there has been a breach of IP.

replies(3): >>homele+Ae >>Deskto+Ye >>the847+Ej

>>rand0m+(OP)
I think this title is overstated. It seems like Brave is trying to do the right thing here vs other companies that don't even make the attempt. (Also, crawling as a service has been a thing for a while.)

replies(1): >>jsnell+yh

>>verisi+Zd
Do you mean like trap streets? Seems like a good idea for model makers

https://en.wikipedia.org/wiki/Trap_street

replies(1): >>lacrim+kg

>>rand0m+(OP)
I firmly believe that corps like these don't deserve the benefit of the doubt. Google, Brave and really anyone big enough to allow themselves to do bad things and get away with it must adhere to a standard where they proactively show their stuff doesn't have malicious intents.

replies(1): >>source+ef

>>verisi+Zd
While not intentionally wrong, Van Halen's brown M&M's rider comes to mind as an example of a similar measure.

>>isodev+Re
As always, if the product is free, you are the product...

replies(2): >>isykt+Xh >>woah+3i

>>homele+Ae
Yeah, something like that may be already happening and various actors building their cases as we speak.

>>throwa+ee
> It seems like Brave is trying to do the right thing here vs other companies that don't even make the attempt

I feel like I'm missing something. What the article claims they're doing is:

1. Misrepresenting what rights they have, and selling access to those rights.

2. Stealth-crawling the web, hiding from the webmasters just how much Brave is crawling their site, and making it impossible to block just their crawler.

How is either of these the right thing? I mean, for somebody besides Brave. What "attempt" are they making that other companies aren't?

replies(2): >>woah+ci >>throwa+Di

>>source+ef
“Although the saying tells us “If it’s free, then you are the product,” that is also incorrect. We are the sources of surveillance capitalism’s crucial surplus: the objects of a technologically advanced and increasingly inescapable raw-material-extraction operation. Surveillance capitalism’s actual customers are the enterprises that trade in its markets for future behavior.”

Excerpt From The Age of Surveillance Capitalism Shoshana Zuboff

>>source+ef
Did you read the article? This is about a paid web crawling api that the author thinks is too good or something. Nothing about a free product

>>jsnell+yh
Is there something wrong with accessing information that someone has posted for public access?

replies(2): >>JumpCr+mj >>theamk+gl

>>jsnell+yh
I think the first one seems to be a case where Brave just has incomplete information about licensing so for the Wikipedia data and other CCthey need to provide a link.

The second doesn't seem like a problem to me as long as they respect robots.txt

replies(2): >>jsnell+wk >>skille+9l

>>woah+ci
> Is there something wrong with accessing information that someone has posted for public access?

The Wikipedia example is glaring. They’re scraping content, stripping attribution and reselling it with a right to lock it down in a way that is not allowed by the original license.

Brave is laundering copyleft content while lying to their customers by selling a license they can’t give. If you’d like, you can sidestep the morality of copyright entirely and focus on the plagiarism and fraud.

>>rand0m+(OP)
From article:

> without any worry for copyright infringement because Brave acts as a middleman.

This isn’t how law works. Unless Brave is explicitly indemnifying all their customers (which their lawyers would have to be insane to let them do), any trouble you could get in, is going to be 100% your problem. Pointing the finger at Brave could theoretically get them in trouble too, but would in no way let you off the hook.

>>verisi+Zd
Depends, how do you distinguish humans acquiring knowledge by ingesting copyrighted content vs. a human using an AI that ingested copyrighted content?

replies(1): >>JumpCr+dl

>>throwa+Di
You didn't actually answer the question, at best you've sidestepped it by claiming that the dodgy shit is either by accident, or really not that bad. Maybe so.

But your original claim wasn't just "Brave are technically not doing anything illegal" or "they're no worse than the others". It was praising them for being better than the others, that they're the only ones trying to do the right thing. And for these example it's just not true, they're outright worse than the industry standard.

So, to repeat, what makes you think that "Brave is trying to do the right thing while other companies aren't even attempting"?

>>rand0m+(OP)
Why use brave if my info is already being leaked by third parties? E.g. experian. Is it worth the inconvenience and their repeated tricky attempts at monetizing their security conscious niche? Not being facetious, just a real question from a non security conscious person.

replies(2): >>asynch+tA >>soundn+OO

>>throwa+Di
I think you're missing the point. This is one example that uses a specific license, there are countless other licenses.

And you don't seem to have read the article either, because clearly it was explained that they don't respect robots.txt because they have no user-agent.

>>the847+Ej
> how do you distinguish humans acquiring knowledge by ingesting copyrighted content vs. a human using an AI that ingested copyrighted content

Doesn’t matter when the content is reproduced verbatim, as Brave is doing. If I memorise your content and then repeat it as my own, I’m not somehow off the hook for copyright violation and plagiarism.

>>woah+ci
Yes. Legalities aside, stripping attribution (author names) from contents which specifically requires keeping it, it a really shitty thing to do.

(The fact that they include original URL does not change much, given that they explicitly market it as "Data for AI" and those systems never have attribution)

>>rand0m+(OP)
> Simply observe the event in which a user does a query q in Brave and then, within one hour, does the same query on a different search engine. What we do is to move the script that detects bad-queries to the browser, run it against the queries that the user does in real-time and then, when all conditions are met, send the following data back to our servers.

Wait. Brave browser sends back to Brave Search engine about your browsing? Other search engines usage, but also crawl pages on your computer to help build their search index?

Ref: https://github.com/brave/web-discovery-project/blob/main/mod...

replies(4): >>jrmg+Vr >>w0ts0n+wC >>drusep+IL >>choppa+pn2

>>rand0m+(OP)
It's always surprising to me when I hear people using the brave browser... It's by a company that initially tried to replace their blocked ads with their own "safe and non-intrusive" ads as far as I remember, until they backpaddled because of the outrage.

It's also a for-profit company and you're not the customer, as you're not paying them money.

I'd be way more worried how they're using the data they're collecting on you vs Google or MS

replies(5): >>nicce+pp >>2Gkash+1t >>cempak+vt >>rglull+Cu >>siquic+KF1

>>411111+hn
People still like to defend Brave when it gets caught on shady things over and over again. I guess there are no too many other options. For some people it is already too difficult to install uBlock or know its existence.

replies(1): >>lalala+mH

>>hartat+am
If you don’t trust Brave then, yeah, they could be doing anything in the browser or on their servers - but that snippet you quoted is a slightly out of context statement from a big document about how they collect data like this, but _don’t_ collect or store it in a way that they could associate it with a user.

If you don’t trust that they’re doing what they say they are, then the document doesn’t mean anything. Although that would also mean the quote is kind of meaningless…

replies(2): >>hartat+8t >>mx20+9y

>>rand0m+(OP)
> Fair use is a doctrine in the law of the United States that allows limited use of copyrighted material without requiring permission from the rights holders. It provides for the legal, non-licensed citation or incorporation of copyrighted material in another author's work under a four-factor balancing test:

> 1) The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes

> 2) The nature of the copyrighted work

> 3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole

> 4) The effect of the use upon the potential market for or value of the copyrighted work

[emphasis from TFA]

HN always talks about derivative work and transformativeness, but never about these. The fourth one especially seems clear in its implications for models.

Regardless, it makes it seem much less clear cut than people here often say.

replies(5): >>flango+at >>civili+Zt >>amluto+jv >>beware+Av >>_fbpp+dw

>>411111+hn
We have these cropping up like ants.

Mullvad

Brave

Opera

Vivaldi

Microsoft

Heck zoho is in on a browser now

What net gain does each of these companies provide over skinning chromium that isn't in Firefox?

Last time I asked brave fanboys why they don't redskin Firefox and the response was "Firefox is pita to build" all the while we have projects like palemoon and waterfox that are hobby projects. If they can work with firefox, so could someone else but no

replies(5): >>throwa+Nt >>rglull+Nv >>soundn+tJ >>Arclig+xo1 >>wallmo+UQy

>>jrmg+Vr
The rest of the document is worst. They say they are using your computer to crawl pages you visit and report back to their server. Even Google doesn't do that.

replies(2): >>w0ts0n+fD >>yjftsj+wR

>>6gvONx+qs
Microsoft is gambling on the hope that model training will be ruled fair use. This makes it seem that outcome is unlikely.

replies(1): >>brooks+wu

>>411111+hn
> I'd be way more worried how they're using the data they're collecting on you vs Google or MS

Why? They don't even have access to my emails and texts like those other companies do. I also don't see the names of their top executives and founders showing up in articles about connections to Jeffrey Epstein every few months.

>>2Gkash+1t
opera offers a free vpn and builtin adblocker

i would use it daily if the UI/UX was better, or more similar to firefox

replies(1): >>Timber+eT

>>6gvONx+qs
That’s not at all clear to me. IANAL but first of all it’s a balancing test, not a bright-line test. The judge could focus on any one factor and make an argument for either side quite easily.

Second, “use” here could mean one of two things: training or inference. It’s publishing the results of inference that can lead to actual effects on the market, not the training.

At the end of the day, someone has to prove tangible harm.

>>flango+at
Do you think a human learning something from reading is fair use? Or are we all copyright violators because reading that article altered our connectomes, and we may recall parts of it later?

replies(2): >>ethanb+rv >>snicke+yC1

>>411111+hn
You are a victim of the Mandella effect. There never was anything related to replacing ads in-page, yet if you ask all detractors what they don't like about it, that's the first point they bring up.

replies(4): >>impiss+VE >>mtlmtl+jG >>lalala+9H >>tokai+ZH

>>6gvONx+qs
I would look at #1 here. Crawling the Internet to collect information is one thing. (And people putting text on the web without requiring authentication seem to be granting at least some kind of license to anyone who sends a GET request.). But crawling the Internet (via centralized robots or users’ browsers), then storing that data and charging money to others for rights to that data (as Brave seems to be doing, quite explicitly) seems like it deserves a very different evaluation under factor #1.

>>brooks+wu
The point being raised is quite specific. Not sure if you’re willingly ignoring it or what?

The answer is no, because you reading the article didn’t dramatically degrade its market value.

An AI ingesting all content on the internet and then being ultra-effective at frontrunning that content for a large number of future readers does degrade its market value (and subsumes it into the model’s value).

replies(3): >>ivalm+0w >>cma+811 >>everfo+I71

>>6gvONx+qs
Unpopular opinion time:

A ML model is clearly a derivative work of its input.

Here's what I think would be fair:

Anyone who holds copyright in something used as part of a training corpus is owed a proportional share of the cash flow resulting from use of the resulting models. (Cash flow, not profits, because it's too easy to use accounting tricks to make profits disappear).

In the case of intermediaries (e.g., social media like reddit & twitter) those intermediaries could take a cut before passing it on to the original authors.

Obviously hellishly difficult to administer so it's unlikely to happen but I don't see a better answer.

replies(2): >>mattbe+Iz >>Dylan1+oX

>>2Gkash+1t
Have you worked in any project that required a forked browser?

I did. When we folded less than two years later, one of the CTOs biggest stated regrets was that he went with Firefox instead of Chromium. The extension story in Firefox was easily 10x harder. Interfacing with the OS as well. Getting dbus services to work was a fool's errand.

replies(1): >>2Gkash+1x

>>ethanb+rv
I disagree. People learning how to draw does degrade the future value of copyrighted work. Imagine the future where nobody was allowed to learn to draw, existing copyright value would skyrocket!

replies(1): >>llamai+XA

>>6gvONx+qs
The entire fair use claim is derived not from any legal basis, but rather, that "it has to be fair use" because it would be legally catastrophic for OpenAI et al if it weren't true.

If you look at the core argument in favour of fair use, it's that "LLMs do not copy the training data", yet this is obviously false.

For Github copilot and ChatGPT examples of it reciting large sections of training data are well known. Plenty can be found on HN. It doesn't generate a new valid windows serial key on the fly, it's memorized them.

If one wants to be cynical, it's not hard to see OpenAI/etc patching in filters to remove copyrighted content from the output precisely because it's legally catastrophic for their "fair use" claim to have the model spit out copyrighted content. As this is both copyright infringement by itself, and evidence that no matter how the internals of these models work, they store some of the training data anyway.

replies(2): >>twoodf+9F >>kmeist+rX

>>rglull+Nv
Cool so your company folded but as I said, palemoon and waterfox seem to be running just fine.

Thunderbird also works.

I happen to own a brwoser extension and have both chromium and Firefox extensions. I kinda know myself.

replies(1): >>rglull+zH

>>rand0m+(OP)
Unpopular opinion: the next iteration of privacy laws needs to factor in AI. If AI is allowed to slurp up PII or derogative works and the people defending it defend it with the zeal of cryptobros then we're in for a decade of real pain in terms of both copyright law, PII, and IP exposure.

replies(2): >>asynch+jA >>_fbpp+7B

>>jrmg+Vr
How do they detect if someone poisons their data, if they not at least associate IP addresses to the data?

>>beware+Av
I don't know what a fair settlement would be but I'm looking forward to a copyright-holder suing OpenAI to obtain one. These companies have no value if copyright can be enforced on their training data.

replies(1): >>visarg+oQ

>>kodah+px
AI is going to do that irregardless- the debate is essentially going to revolve around how and what people can make new commercial works from that data.

>>lopati+Ak
It’s built in Ad blocker and other features are heads and tails above anything else I’ve used before, personally.

>>rand0m+(OP)
Brave continues to be shady. They claim to respect robots.txt but don't identify their crawler if you want to block it.

> They don't mention their crawler anywhere in their docs, either. So, if you wanted to block Brave from crawling and indexing and ultimately selling your content to third parties, your only option for the time being would be to block all crawlers, which is how Brave would be able to "respect robots.txt".

>>ivalm+0w
Arguments like this are great for getting your side to go "rah rah got 'em" and really, really bad for convincing anyone else.

Legal judgments generally focus on actual impacts rather than quirks that might exist in hypothetical universes.

replies(1): >>tharku+sE

>>kodah+px
The fun part is that the GDPR already does. The answer is you're not allowed to use personal data for AI. (And "personal data" here covers things like all public social media posts)

Facebook recently got told by the CJEU that, no, they can't use people's posts to target advertisements. Even if those ads are what's paying for the platform. That you can't claim such processing as "part of the contract" unless it is absolutely necessary in the same way the post office needs an address to send a parcel.

If Facebook can't even do that, there is no way LLMs will be allowed. (And remember. The GDPR does not care if your system doesn't distribute personal data. Any kind of processing at all falls under the GDPR's requirements)

OpenAI is already being chased by the EU's privacy agencies. Right now they're in the process of asking pointed questions, things will heat up after that.

replies(1): >>BeFlat+l71

>>rand0m+(OP)
The websites a Brave user browses are anonymously relayed to their servers for indexing/training. So, they crawl the web without a crawler and the website operators can't do anything about it.

That's genius!

replies(1): >>jonath+n31

>>rand0m+(OP)
This discussion on fair use are always quite anglocentric.

Atricle 3 and 4 of the EU 'Copyright in the Digital Single Market' give data miners quite extensive rights.

Move operation to the EU, train a foundational model, than train a constitutional model based on that.

As much as I hate the upcoming AI regulation, the CDSM is solid.

https://academic.oup.com/grurint/article/71/8/685/6650009 https://eur-lex.europa.eu/eli/dir/2019/790/oj

Update: Fixed wrong link

replies(1): >>pedroc+OI

>>hartat+am
This is (importantly) opt-in.

"Brave doesn’t follow the sneaky practices of other big tech search engines. The Web Discovery Project is opt-in, and the data collected under the Web Discovery Project has specific protections to ensure anonymity." per https://support.brave.com/hc/en-us/articles/4409406835469-Wh...

replies(2): >>fredol+Qz1 >>choppa+9o2

>>hartat+8t
This is opt-in only. https://support.brave.com/hc/en-us/articles/4409406835469-Wh...

replies(1): >>DonHop+Bg2

>>llamai+XA
While that may be what your parent intended I'm not entirely sure and there does exist the philosophical level discussion here. Or market economics level I guess.

If your pool of people that can learn about topic X is restricted the outputs or their labor are more expensive. Now lift a continent of billions of people out of poverty, get them access to schooling, safety etc and see the market forces do the rest.

Now equate ChatGPT et al with said billion people. Just that it runs on electricity. If quality is good enough of course. Which is hard to decide right now because of hype.

replies(1): >>ivalm+1G1

>>rglull+Cu
I think some initial news articles claimed this and everyone went with it. Which is basically how it worked, replacing ads, but in a different way, and actually more annoying.. block everyone else's ads...and have their own little popup ads, and if you enabled that, you'd get paid in BAT tokens per view too.

>>_fbpp+dw
It actually doesn’t even matter if LLMs reproduce copyrighted data from their training. The issue is that a human copied the data from its source into memory for use in training, and this copy was likely not fair use under cases like MAI Systems.

The Supreme Court hasn’t ruled on a software case like this, as far as I know. But given the recent 7-2 decision against Andy Warhol’s estate for his copying of photographs of Prince, this doesn’t seem like a Court that’s ready to say copying terabytes of unlicensed material for a commercial purpose is OK.

I’m going to guess this ends with Congress setting up some kind of clearinghouse for copyrighted training material: You opt in to be included, you get fees from OpenAI when they use what you added. This isn’t unprecedented: Congress set up special rules and processes for things like music recordings repeatedly over the years.

https://scholarship.law.edu/cgi/viewcontent.cgi?referer=&htt...

replies(2): >>gyudin+eS >>luma+i61

>>rglull+Cu
As a detractor and therefore a part of the set "all detractors", I do not believe this. I just don't buy their shady marketing and try not to support engine monoculture.

>>rglull+Cu
They don't replace ads in-page, but they do something very very similar.

They block the in-page ads and instead provide their own ads through popup notifications.

So they are replacing advertisements on websites.

replies(1): >>rglull+xQ

>>nicce+pp
It's because a lot of people are bought into BAT (Brave's cryptocurrency) and have a strong financial incentive to shill Brave.

replies(3): >>tokai+bI >>DaSHac+gX >>boondo+eZ

>>2Gkash+1x
> palemoon and waterfox seem to be running just fine.

GNU/Hurd is also a very interesting alternative OS, the design is a lot more elegant than GNU/Linux, it's still under active development and it has a surprising number of active users.

It's still a very bad idea to build the foundation of your tech stack on it.

>>rglull+Cu
Seems you're wrong. Here's an archive link to Braves page in '16 were their planned add replacement is explained:

https://archive.md/W0k4j

replies(1): >>lalala+IK

>>lalala+mH
Ah that makes sense why brave fans post a bit like cryptobros.

>>nieman+SB
It's not clear that "data mining" covers this use. These models are huge, big enough that they can just contain direct copies of copyrighted works. They've been shown to reproduce them relatively easily. The argument is that they've actually generalized enough or learned enough that they're now no longer the sum of the dataset. I can definitely see that being possible but the way the technology works it's really hard to know if that has happened or if what's happening instead is a bunch of copyright washing.

There are some things that would make for good faith displays by the players in the space. For example, Microsoft has been investing a lot and yet their code offering is not trained on their internal code base. Same for Google. Start by doing that and I'll entertain the argument that your tools are fair use or data mining.

replies(1): >>nieman+RM

>>2Gkash+1t
Brave literally started on Gecko.

replies(1): >>2Gkash+PJ

>>soundn+tJ
Is brave currently forking chromium or not?

By your logic, opera was having their own engine till 2013. So what?

>>tokai+ZH
One minor note: they made a slight change in plans from that initial design.

How it works now is that when Brave replaces an ad, they put the new ad in a popup, not in-page

replies(2): >>rglull+QV >>Dylan1+vW

>>hartat+am
This specific feature is already opt-in, but historically the answer has always been "yes" for dozens of 'features' like this that fly under the radar until users start complaining, and then eventually get converted to opt-in or removed in order to save face.

>>pedroc+OI
My reading of the relevant laws would actually lead me to believe that this is not a problem, as long as those reproductions are not returned and the eights holder did not opt out. But courts might decide differently.

Regarding the copyright of returned material here is a good discussion:

https://copyrightblog.kluweriplaw.com/2023/05/09/generative-...

replies(2): >>JumpCr+jS >>pedroc+7a1

>>lopati+Ak
You get degoogled Chromium with e2ee bookmarks etc. sync and a lot of nice convenience features like vertical tabs and mobile background video playback.

And if it's your cup of tea, they let you straight up pay money for the search engine.

>>mattbe+Iz
I think there are ways around it. The simplest would be to generate replacement data, for example by paraphrasing the original, or summarising, or turning it into question-answer pairs. In this new format it can serve as training data for a clean LLM. Of course the public domain data would be used directly, no need to go synthetic there.

An important direction would be to train copyright attribution models, and diff-models to detect when a work is infringing on another, by direct comparison. They would be useful to filter both the training set and the model outputs.

replies(1): >>mattbe+h61

>>lalala+9H
Every time this comes up, the argument is the same. People always forget:

- The ad blocker works separately from their own ad service.

- Their own ads are opt-in.

- People receive 70% of the revenue from the ads they see.

- The ads from Brave do not track you and whatever personalisation happens in-device, no data is mined.

So, no. They are not "replacing" anything. They are not stealing anyone's revenue (and no matter how much Linus from LTT argues, he is not entitled to any revenue just because I watched any of his videos) and Brave's own ads are from deals that they closed themselves and a essentially fraud-proof compared with whatever payouts are given by largest ad networks.

In other words, they are just offering something that happens to be infinitely more user-focused than the status quo. Every attempt at framing this as unethical came from an uninformed or biased source.

>>hartat+8t
> Even Google doesn't do that

At least Bing did, though. >>2169793

>>twoodf+9F
So how is that supposed to work with people sending it legally obtained copyrighted materials for an analyze?

replies(1): >>twoodf+011

>>nieman+RM
> as long as those reproductions are not returned

That’s the author’s entire gripe. Brave reproduced a Wikipedia entry without attribution and then slapped a copyright on it to boot.

>>throwa+Nt
If am not wrong, Opera is owned by some Chinese company and they are known for doing some really shady stuff [0][1] in African countries.

[0] https://blogs.opera.com/africa/2022/05/free-data-with-opera-...

[1] https://www.androidpolice.com/2020/01/21/opera-predatory-loa...

replies(1): >>KORraN+3X

>>lalala+IK
Correct. I've been using Brave since their very first versions on the desktop, and there never was any in-page ad insertion.

The one type of in-page modification they used to do is that they would add a "tip" button to the content creator of some social networks like Twitter or reddit. That had nothing to do with "replacing ads" though.

> replaces an ad, they put the new ad in a popup

Incorrect. There is no 1:1 replacement. You as the user can define how often you want to receive notifications, and even then the notifications only come when you are switching context between any action. It won't interrupt you while you are watching a video, working on google doc spreadsheet or reading though HN.

replies(1): >>411111+TW

>>lalala+IK
"replaces an ad, they put the new ad"

That's not how it works. If you turn on Brave ads, they show up every once in a while, completely independently of webpage ads. And they work whether your ad blocker is on or off.

>>rglull+QV
You're responding to a comment that gave you a link to their inital plan, which was literally replacing the ads.

click on it, your horizon might be broadened by the added knowledge.

replies(1): >>Dylan1+RX

>>Timber+eT
True, right now lots of folks from the original Opera team (including CEO) work on Vivaldi. If one day Mozilla forces me to ditch Firefox, I will probably switch to this browser.

>>lalala+mH
BAT gives like no money, especially after the crypto crash. Its far more likely its just the browser wars of old, but with even less options to choose from people are going to be more adamant their choice is the best.

>>beware+Av
> A ML model is clearly a derivative work of its input.

Do you mean this in a copying sense or a mathematical sense?

What if it's only storing 1 byte per input document?

>>_fbpp+dw
OpenAI's bias research on DALL-E revealed that most examples of regurgitation come from repeated copies of the same image in the training set. When they filtered out duplicates, DALL-E stopped drawing training examples.

The problem is that filtering the training set is naively O(n^2) and n is already extremely large for DALL-E. For LLMs, it's comically huge, plus now you have to do substring search. I've yet to hear OpenAI talk about training set deduplication in the context of LLMs.

As for the legal basis... nobody's ruled on AI training sets in the US. Even the Google Books case that I've heard cited in the past (even by myself) really only talks about searching a large corpus of text. If OpenAI's GPT models were really just a powerful search engine and not intelligent at all, they'd actually be more legally protected.

My money's still on "training is fair use", but that actually doesn't help OpenAI all that much either, because fair use is not transitive. Right now, such a ruling would mean that using AI art is Russian roulette: if your model regurgitates, the outputs are still infringing, even if the model is fair use. Novel outputs aren't entirely safe, though. A judge willing to commit the Butlerian Jihad[0] might even say that regurgitation does not matter and that all AI outputs are derivative works of the entire training set[1].

This logic would also apply in the EU. Last I checked the TDM exception only said training is legal, not that you could sell the outputs. They don't really respect jurisprudence the way the Anglosphere obsesses over "precedent", so copyright exceptions are almost always decided by legislatures and not judges over there, and the likelihood of a judge saying that all outputs are derivative works of the training set regardless of regurgitation is higher.

[0] In the sci-fi novel Dune, the Butlerian Jihad is a galaxy-wide purge of all computer technology for reasons that are surprisingly pertinent to the AI art debate.

Yes, this is also why /r/Dune banned AI art. No, I have not read Dune.

[1] If the opinion was worded poorly this would mean that even human artists taking inspiration to produce legally distinct works would be violating copyright. The idea-expression divide would be entirely overthrown in favor of a dictatorship of the creative proletariat.

[2] "Music and Film Industry Association of America" - an abbreviation coined for an April Fools joke article about the MPAA and RIAA merging together.

replies(2): >>richk4+o71 >>6gvONx+Kt1

>>411111+TW
That was their original plan, yes.

lalaland1125 is making claims about what they actually did, and those claims are not correct.

>>lalala+mH
I do not use BAT or any crypto. Brave just works, and it blocks ads automatically when I tell friends to install it on their computers.

I used to recommend Firefox, but Mozilla has totally jumped the shark (privacy violations [multiple], wastes too much money, blocks APIs that are useful with no real security risks while approving APIs with little use that do have security risks, etc, very user hostile).

Chromium is obviously not trustworthy at this point, let alone Chrome. So that leaves like, Safari and Opera?

Brendan Eich is the CEO of Brave, and I trust him. Mozilla was good until he was ousted for political reasons.

replies(1): >>nicce+g41

>>gyudin+eS
That copy (the “send”) would be evaluated under the same fair use criteria.

“Write a review of this short story: …” – probably fine.

“Rewrite this short story to have a happier ending: …” – probably not.

>>ethanb+rv
> The answer is no, because you reading the article didn’t dramatically degrade its market value.

How about if you read a news article to write a competing one rewording and possibly citing it (one of the most common practices in news)?

replies(1): >>ethanb+rl1

>>k__+DB
I'm Sampson, from the Brave team. The Web Discovery Project is a clever approach. For Brave to compete with Google, and offer a truly novel index of the Web, a novel approach must be taken. The WDP is an opt-in, privacy-preserving approach which gives Brave a fighting chance against the Search incumbants. Due to our preference of "Can't be evil" over "Don't be evil," the WDP is not only designed with privacy and anonymity as a prerequisite, but it is also open-source for public scrutiny and evaluation: https://github.com/brave/web-discovery-project.

replies(2): >>k__+hv1 >>yamsam+5z1

>>boondo+eZ
> Chromium is obviously not trustworthy at this point, let alone Chrome. So that leaves like, Safari and Opera

Brave is like 99% of Chromium + uBlock…

replies(1): >>boondo+Z91

>>visarg+oQ
Would automated paraphrasing not be a derivative work of the original?

replies(1): >>visarg+4w2

>>twoodf+9F
How does that align with Google Books scanning libraries full of copyrighted text, offering full reproductions of sections of the work, and then having the supreme court declare it all to be Fair Use? I think that is a far more relevant precedent here: https://en.m.wikipedia.org/wiki/Authors_Guild,_Inc._v._Googl....

replies(2): >>twoodf+j71 >>6gvONx+xt1

>>luma+i61
The Supreme Court declined to hear the case on appeal, which is a shade different from endorsing the decision after a hearing.

That being said, it doesn’t take a lot of effort to differentiate these cases. Google was indexing copyrighted works and providing access to limited extracts. They weren’t transforming them into new works and then selling access to those new works over APIs.

replies(1): >>luma+ip1

>>_fbpp+7B
End result: EU AI enjoyers use a VPN plus a US-based credit card borrowed from a friend.

>>kmeist+rX
> A judge willing to commit the Butlerian Jihad[0] might even say that regurgitation does not matter and that all AI outputs are derivative works of the entire training set[1].

A judge can’t “commit” the butlierian jihad. A jihad is a mass event caused by some fraction of the population believing in some cause.

Which kinda gets to a point that seems to be missed. Copyright law is not “intrinsic” - nobody thinks that copyright is a natural law - it is just a pragmatic implementation which balances various public and private goods. If the world changes such that the law no longer does a good job of balancing the various goods, then either the law will get changed or people will ignore the law.

replies(1): >>kmeist+Lg1

>>ethanb+rv
This applies to so many things, though.

The most obvious parallel to me is YouTube. There are a ton of people ingesting books, then transforming that information into a roughly paraphrased video for people to watch for free (ish). That devalues the books they read and paraphrased, because other people don't need to read them.

Spark Notes devalue actual books in a way, because a lot of high schoolers read those instead of buying the actual book.

Search engines have also supplanted books in large part, because I don't need a whole book to answer a specific question. I don't know anyone that owns an encyclopedia anymore.

This is the next iteration of these processes. Non-novel information's market value has been degrading for decades now. A series of questions that would have cost thousands of dollars in books to answer in the 70's/80's is now free, with or without AI.

replies(1): >>llamai+jl1

>>nicce+g41
Right, I should have said "the main Chromium branch is obviously not trustworthy". It is possible to remove the untrustworthy bits, however, and there are a variety of de-googled Chromium builds.

Chromium is a great browser, unfortunately the official branch has been poisoned by Google.

>>nieman+RM
That's clearly not enough. There's a continuum between producing exact input copy and having genuine creativity because the model actually learned something. A model that just reformats code and changes all the variable names would pass your test and yet be clearly a copyright violation. This whole argument requires that the neural network weights do something creative because they learned from the code instead of just transforming it. We're even careful about this with humans with things like clean room reimplementations to make sure.

replies(1): >>hakfoo+kM1

>>richk4+o71
Copyright is a unique case in which the law represents a bargain struck in the 1970s that hasn't been updated since. Everyone ignores it because it's nearly impossible to actually enforce copyright on individual infringers. But that doesn't mean copyright is meaningless: any activity which is large enough to be legible[0] to the state will be forced to bend itself to fit within the copyright bargain.

And AI training is extremely legible. This is not like a bunch of people downloading stuff off BitTorrent. All of the large foundation models we use were trained by a large corporation with a source of venture capital funding which could be easily shut off by a sufficiently motivated government. Weights-available and liberally licensed models exist, but most improvements on them are fine-tuning. Anonymous individuals can fine-tune an LLM or art generator with a small amount of data and compute, but they cannot make meaningful improvements on the state of the art.

So our sufficiently motivated copyright judge could at least effectively freeze AI art in time until Big Tech and the MAFIAA agree on how to properly split the proceeds from screwing over individual artists.

"Butlerian Jihad" is a term from a book, so you don't need to take "jihad" literally. However, I will point out that there is a significant fraction of the population that does want to see AI permanently banned from creative endeavors. The loss of ownership over their work from having it be in the training set is a factor, but their main argument is that they specifically want to keep their current jobs as they are. They do not want to be replaced with AI, nor do they want to replace their existing drawing work with SEO keyword stuffed text-to-image prompts.

[0] https://en.wikipedia.org/wiki/Seeing_Like_a_State

replies(1): >>richk4+L52

>>everfo+I71
LLMs are attracting so much positive attention because they are likely to be a huge, huge step change improvement than all those methods you mention.

For that same exact reason, it’s totally reasonable they’re attracting unique amounts of negative attention too.

You can’t have it both ways: yes LLMs are going to change information retrieval the way nothing else has before, but no it’s actually just like all the other things in terms of their impact on incentive structures.

FWIW I don’t really know where I land on this issue. I just find it totally incoherent to believe in the bull case of “this will transform everything” while also portraying it all as par for the course when discussing potential negatives.

Just because Spark Notes didn’t obviously manage to kill valuable parts of our information ecosystem and economy does not mean that Spark Notes x 10,000,000 will not.

>>cma+811
How about it? Do you not think it incurs a lot of negative effects?

replies(1): >>cma+2F7

>>2Gkash+1t
Mullvad is actually a Firefox fork and it directly uses Tor's privacy enhancements[0] to Firefox for a private web browsing experience. As a matter of fact, it really looks like Tor Browser but with a VPN baked in instead of Tor.

[0] https://mullvad.net/en/browser

>>twoodf+j71
OpenAI is also providing access to limited extracts. Google wasn't selling this over an API, they were providing "free" access to it while displaying ads to the user. Would the courts see this manner of monetization to be different enough that settled case law wouldn't apply?

replies(1): >>twoodf+RB1

>>luma+i61
Google also bought copies of each book, I believe, which makes it another step removed from standard ML practice.

>>kmeist+rX
> The problem is that filtering the training set is naively O(n^2)

There are standard ways to do it that are O(n), FYI.

>>jonath+n31
.es?

>>rand0m+(OP)
My entirely biased opinion is https://www.mojeek.com/ - a traditional search engine crawler (as in, follow links on the web) that identifies its user agent. Dead Simple. The open web, you can search on it.

>>jonath+n31
It's not a clever approach, it's basically scraping Google results because that's where your users are searching. You follow the bread crumbs from Google searches.

Cliqz entire history was based on this kind of thing, milking off other search engines by just deducting their ranking methods, it's parasitic. There's no cleverness about it.

replies(1): >>throwa+DB1

>>w0ts0n+wC
I think mentioning your affiliation with Brave might go a long way in contextualizing why you are defending/rationalizing this (even if opt-in).

Editing to add that I don't mean to imply ill will on your part, but that I think being affiliated with Brave might have you taking this type of practice a little more lightly than it probably should be taken.

>>yamsam+5z1
I don't know a lot about this particular approach but your comment that it's just using Google results is blatantly false. It all depends on the search engine that the brave user is leveraging, or no search engine if they type in the URL directly into the header.

replies(1): >>yamsam+FK1

>>luma+ip1
OpenAI isn’t doing anything like what Google was doing with Books. It’s not hard for laymen to see that, and it’s going to be obvious to any judge who hears a case.

Imagine OpenAI had invented a software program that turned any written text into an animated cartoon enacting the text. That would obviously be creating a derivative work and outside fair use bounds. That they mix a bunch of works (copyrighted and otherwise) into a piece of software doesn’t allow them to escape that basic analysis.

Google showed a “clip” of the original work, no different in scope than Siskel & Ebert showing a clip of a film as they reviewed it. The uses are not comparable.

>>brooks+wu
Yes it is considered fair use but it's also completely irrelevant because we're talking about a computer program not a person.

>>411111+hn
There are multiple ways you can pay Brave.

https://brave.com/firewall-vpn/ https://account.brave.com/?intent=checkout&product=search https://brave.com/search/api/

>>tharku+sE
Your sentiment is exactly what I intended, albeit I was terse and a little facetious. ChatGPT is like introducing a bunch of new skilled labor, it’s just for the first time this skilled labor isn’t human. The fact that this skilled labor learned from copyrighted material is like saying human labor learned from copyrighted material.

>>throwa+DB1
Nonsense. This naive idea that Brave innocuously looks at user's traffic patterns.

Google owns 95% of the market in most Western markets. There's no "blatantly false" about that.

They scrape search engine results and present them as their own.

Do 10,000 searches on Google and Brave and you'll see how similar they are. It's as simple as that, scraping by sleight of hand.

Why can't they be a normal search engine - because they need to scrape others. Simples.

replies(2): >>k__+eu3 >>spider+Gc5

>>pedroc+7a1
The window of possible actual creativity may be limited and variable.

There are a lot of pretty complex prompts, where if you asked a group of reasonably skilled programmers to implement, they'd produce code that was "reformatted and changed variable names" but otherwise identical. Many of us learned from the same foundational materials, and there are only a handful of non-pathological ways to implement a linked list of integers, for example.

With code it may be more obvious, in that you can't as easily obfuscate things with synonyms and sentence structure changes. Even with prose, there is going to be a tendency to use "conventional" language choices, driving you back towards a familiar-looking mean.

>>kmeist+Lg1
Butlerian jihad is a good reference point. Something so bad happened that a large enough portion of the population was convinced to destroy thinking machines, and this no-computer norm was held in human society for a crazy long time (been too long for me to remember how long elapsed before Chapterhouse, which I think is the book where thinking machines start returning). It was a core belief of humanity that computers were bad, not a law imposed by a judge or legislature.

So say a US judge did impose severe restrictions on LLMs through US copyright law. The giant companies that are using LLMs will just move to another country. And just like tax law, others will be happy to have them. Would the US start blocking inbound internet traffic from countries that don’t have the same interpretation of copyright? That seems very unlikely.

The point is that the only way LLMs get the butlerian jihad treatment is if the people rise up against them. Right now, that is nowhere close to happening.

>>w0ts0n+fD
If you're shilling for Brave, you should reveal you affiliation with them.

What is it?

>>hartat+am
And Google gets the same data joining your cookies ever since Google Plus unified auth across their properties a decade ago. Wait you mean you thought G+ was supposed to compete against Facebook-the-product and not just Facebook-the-ad-network? Oops

Brave is perfectly OK with having oopsies too

>>w0ts0n+wC
Opt-in or not, “sneaky” is a marketing term and not a UX principle. E.g. showing the user an example of real ROI from providing their data.

That said, stuff like Jedi Blue and Project Bernanke suggest Brave could just disclose they support competitive markets.

>>mattbe+h61
So you think any paraphrase of a copyrighted phrase is in copyright violation? That's like owning the idea itself. Is any utterance similar to this one now forbidden?

replies(1): >>mattbe+XO3

>>yamsam+FK1
I would expect most people today find interesting content via social media, and not search engines.

>>visarg+4w2
I think if you automate paraphrasing from an original work to use that original work on bulk somehow, yes.

How do you even automate paraphrasing without training it on lots of original work? It's infringement all the way down.

>>yamsam+FK1
I had a discussion about this on Twitter with Brendan Eich. He became hostile very quickly. He is not a very nice person.

>>ethanb+rl1
What's the alternative, each news event is first to publish exclusivity regardless of quality? No synthesizing multiple stories into a linked narrative?

>>2Gkash+1t
Uh. "cropping up like ants" is definitely a take. Not a good one, as most the browsers here had their first release date of 199*. I will list them out.

Mullvad, is the Tor Browser with the Mullvad VPN included, and released 2023. However, the Tor Browser, which it effectively is, is from 2002.

Brave, the one in this article, is from 2019.

Opera is from 1994.

Vivaldi is from 2015, and is developed by Opera's previous dev-team after a bad sale to a Chinese company.

Microsoft's first browser, Internet Explorer, is from 1995.

I can not comment about Zoho's browser, as i know little about it.

zlacker

The shady world of Brave selling copyrighted data for AI training