Anthropic cut up millions of used books, and downloaded 7M pirated ones

>>pyman+(OP)
Apparently it's a common business practice. Spotify (even though I can't find any proof) seems to have build their software and business on pirated music. There is some more in this Article [0].

https://torrentfreak.com/spotifys-beta-used-pirate-mp3-files...

Funky quote:

> Rumors that early versions of Spotify used ‘pirate’ MP3s have been floating around the Internet for years. People who had access to the service in the beginning later reported downloading tracks that contained ‘Scene’ labeling, tags, and formats, which are the tell-tale signs that content hadn’t been obtained officially.

>>pyman+G4
Let's not forget Spotify ;)

https://gizmodo.com/early-spotify-was-built-on-pirated-mp3-f...

>>marapu+ib
Crunchyroll was originally an anime piracy site that went legit and started actually licensing content later. They started in mid-2006, got VC funding in 2008, then made their first licensing deal in 2009.

https://www.forbes.com/2009/08/04/online-anime-video-technol...

https://venturebeat.com/business/crunchyroll-for-pirated-ani...

>>pyman+(OP)
https://archive.md/YLyPg

>>pyman+(OP)
Here is how individuals are treated for massive copyright infringement:

https://investors.autodesk.com/news-releases/news-release-de...

>>bgwalt+pE
I thought you'd go with this: https://en.wikipedia.org/wiki/United_States_v._Swartz

>>trinsi+pL
Copyright is largely about distributing copies. It’s not about making something vaguely similar or about referencing copyrighted work to make something vaguely similar.

Although, there’s an exception for fictional characters:

https://en.m.wikipedia.org/wiki/Copyright_protection_for_fic...

>>achier+fN
Maybe the most memorable version of the response is this the "Copying is not Theft" song. https://www.youtube.com/watch?v=IeTybKL1pM4

>>trinsi+pL
There is another case where companies slurped up all of the internet and profited off the information, that makes a good comparison - search engines.

Judges consider a four factor when examining fair use[1]. For search engines,

1) The use is transformative, as a tool to find content is very different purpose than the content itself.

2) Nature of the original work runs the full gamut, so search engines don't get points for only consuming factual data, but it was all publicly viewable by anyone as opposed to books which require payment.

3) The search engine store significant portions of the work in the index, but it only redistributes small portions.

4) Search engines, as original devised, don't compete with the original, in fact they can improve potential market of the original by helping more people find them. This has changed over time though, and search engines are increasingly competing with the content they index, and intentionally trying to show the information that people want on the search page itself.

So traditional search which was transformative, only republished small amounts of the originals, and didn't compete with the originals fell firmly on the side of fair use.

Google News and Books on the other hand weren't so clear cut, as they were showing larger portions of the works and were competing with the originals. They had to make changes to those products as a result of lawsuits.

So now lets look at LLMs:

1) LLM are absolutely transformative. Generating new text at users request is a very different purpose and character from the original works.

2) Again runs the full gamut (setting aside the clear copyright infringement downloading of illegally distributed books which is a separate issue)

3) For training purposes, LLMs don't typically preserve entire works, so the model is in a better place legally than a search index, which has precedent that storing entire works privately can be fair use depending on the other factors. For inference, even though they are less likely to reproduce the originals in their outputs than search engines, there are failure cases where an LLM over-trained on a work, and a significant amount the original can be reproduced.

4) LLMs have tons of uses some of which complement the original works and some of which compete directly with them. Because of this, it is likely that whether LLMs are fair use will depend on how they are being used - eg ignore the LLM altogether and consider solely the output and whether it would be infringing if a human created it.

This case was solely about whether training on books is fair use, and did not consider any uses of the LLM. Because LLMs are a very transformative use, and because they don't store original verbatim, it weighs strongly as being fair use.

I think the real problems that LLMs face will be in factors 3 and 4, which is very much context specific. The judge himself said that the plaintiffs are free to file additional lawsuits if they believe the LLM outputs duplicate the original works.

[1] https://fairuse.stanford.edu/overview/fair-use/four-factors/

>>buzzer+lR
Sweden has a political party called "The Pirate Party"(1), and "The Pirate Bay" is Swedish so I think a couple of Swedes memeing before it was cool has a significant impact on making the name stick but also taking the seriousness out of it.

1: https://piratpartiet.se/en/

>>palmot+aP
The judge did use some language that analogized the training with human learning. I don't read it as basing the legal judgement on anthropomorphizing the LLM though, but rather discussing whether it would be legal for a human to do the same thing, then it is legal for a human to use a computer to do so.

  First, Authors argue that using works to train Claude’s underlying LLMs was like using
  works to train any person to read and write, so Authors should be able to exclude Anthropic
  from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for
  training or learning as such. Everyone reads texts, too, then writes new texts. They may need
  to pay for getting their hands on a text in the first instance. But to make anyone pay
  specifically for the use of a book each time they read it, each time they recall it from memory,
  each time they later draw upon it when writing new things in new ways would be unthinkable.
  For centuries, we have read and re-read books. We have admired, memorized, and internalized
  their sweeping themes, their substantive points, and their stylistic solutions to recurring writing
  problems.

  ...

  In short, the purpose and character of using copyrighted works to train LLMs to generate
  new text was quintessentially transformative. Like any reader aspiring to be a writer,
  Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but
  to turn a hard corner and create something different. If this training process reasonably
  required making copies within the LLM or otherwise, those copies were engaged in a
  transformative use.

[1] https://authorsguild.org/app/uploads/2025/06/gov.uscourts.ca...

>>throwa+QY
I'm trying to find the quote, but I'm pretty sure the judge specifically said that going and buying the book after the fact won't absolve them of liability. He said that for the books they pirated they broke the law and should stand trial for that and they cannot go back and un-break in by buying a copy now.

Found it: https://www.nbcnews.com/tech/tech-news/federal-judge-rules-c...

> “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft,” [Judge] Alsup wrote, “but it may affect the extent of statutory damages.”

>>organs+ZV
That a service incorporating the authors' works exists is not at issue. The plaintiffs' claims are, as summarized by Alsup:

  First, Authors argue that using works to train Claude’s underlying LLMs 
  was like using works to train any person to read and write, so Authors 
  should be able to exclude Anthropic from this use (Opp. 16). 

  Second, to that last point, Authors further argue that the training was 
  intended to memorize their works’ creative elements — not just their 
  works’ non-protectable ones (Opp. 17).

  Third, Authors next argue that computers nonetheless should not be 
  allowed to do what people do.

https://media.npr.org/assets/artslife/arts/2025/order.pdf

>>charci+4Q
> RMS

Referring to this? (Wikipedia's disambiguation page doesn't seem to have a more likely article.)

https://en.wikipedia.org/wiki/Richard_Stallman#Copyright_red...

>>dehrma+DS
AFAIK, Judge Vince Chhabria has countered that Fair Use argument in a later order involving Meta.

https://www.courtlistener.com/docket/67569326/598/kadrey-v-m...

Note: I am not a lawyer.

>>kube-s+fa1
He wasn't facing anywhere near that. When the DOJ charges someone with a set of charges they like to say in the press release that the person is facing N years, where they get N by simply adding up the maximums for each charge that it is possible for a hypothetical defendant that has all the possible sentence enhancing factors to get. They also ignore that some charges group for sentencing--your sentence for the group is the maximum sentence for the individual charges in the group.

Here's an article explaining in more detail [1].

Most experts say that if Swartz had gone to trial and the prosecution had proved everything they alleged and the judge had decided to make an example of Swartz and sentence harshly it would have been around 7 years.

Swartz's own attorney said that if they had gone to trail and lost he thought it was unlikely that Swartz would get any jail time.

Swartz also had at least two plea bargain offers available. One was for a guilty plea and 4 months. The other was for a guilty plea and the prosecutors would ask for 6 months but Swartz could ask the judge for less or for probation instead and the judge would pick.

[1] https://www.popehat.com/2013/02/05/crime-whale-sushi-sentenc...

>>codedo+J51
> Maybe instead of books we should start making applications that protect the content and do not allow copying text or making screenshots.

https://en.wikipedia.org/wiki/Analog_hole

>>hadloc+2e1
Fun fact, they didn't have the rights to use the font they used for those commercials: >>43775926

>>pyman+(OP)
The solution has always been: show us the training data.

As a researcher I've been furious that we publish papers where the research data is unknown. To add insult to injury, we have the audacity to start making claims about "zero-shot", "low-shot", "OOD", and other such things. It is utterly laughable. These would be tough claims to make *even if we knew the data*, simply because of its size. But not knowing the data, it is outlandish. Especially because the presumptions are "everything on the internet." It would be like training on all of GitHub and then writing your own simple programming questions to test an LLM[0]. Analyzing that amount of data is just intractable, and we currently do not have the mathematical tools to do so. But this is a much harder problem to crack when we're just conjecturing and ultimately this makes interoperability more difficult.

On top of all of that, we've been playing this weird legal game. Where it seems that every company has had to cheat. I can understand how smaller companies turn to torrenting to compete, but when it is big names like Meta, Google, Nvidia, OpenAI (Microsoft), etc it is just wild. This isn't even following the highly controversial advice of Eric Schmidt "Steal everything, then if you get big, let the lawyers figure it out." This is just "steal everything, even if you could pay for it." We're talking about the richest companies in the entire world. Some of the, if not the, richest companies to ever exist.

Look, can't we just try to be a little ethical? There is, in fact, enough money to go around. We've seen unprecedented growth in the last few years. It was only 2018 when Apple became the first trillion dollar company, 2020 when it became the second two trillion, and 2022 when it became the first three trillion dollar company. Now we have 10 companies north of the trillion dollar mark![3] (5 above $2T and 3 above $3T) These values have exploded in the last 5 years! It feels difficult to say that we don't have enough money to do things better. To at least not completely screw over "the little guy." I am unconvinced that these companies would be hindered if they had to broker some deal for training data. Hell, they're already going to war over data access.

My point here is that these two things align. We're talking about how this technology is so dangerous (every single one of those CEOs has made that statement) and yet we can't remain remotely ethical? How can you shout "ONLY I CAN MAKE SAFE AI" while acting so unethically? There's always moral gray areas but is this really one of them? I even say this as someone who has torrented books myself![4] We are holding back the data needed to make AI safe and interpretable while handing the keys to those who actively demonstrate that they should not hold the power. I don't understand why this is even that controversial.

[0] Yes, this is a snipe at HumanEval. Yes, I will make the strong claim that the dataset was spoiled from day 1. If you doubt it, go read the paper and look at the questions (HuggingFace).

[1] https://www.theverge.com/2024/8/14/24220658/google-eric-schm...

[2] https://en.wikipedia.org/wiki/List_of_public_corporations_by...

[3] https://companiesmarketcap.com/

[4] I can agree it is wrong, but can we agree there is a big difference between a student torrenting a book and a billion/trillion dollar company torrenting millions of books? I even lean on the side of free access to information, and am a fan of Aaron Swartz and SciHub. I make all my works available on ArXiv. But we can recognize there's a big difference between a singular person doing this at a small scale and a huge multi-national conglomerate doing it at a large scale. I can't even believe we so frequently compare these actions!

>>marapu+ib
Google Music originally let people upload their own digital music files. The argument at the time was that whether or not the files were legally obtained was not Google’s problem. I believe Amazon had a similar service.

https://www.computerworld.com/article/1447323/google-reporte...

>>badlib+Ir1
The first federal copyright law in 1790:

https://copyright.gov/about/1790-copyright-act.html

Specified in dollars because dollars had been invented (in 1789), but in the amount of one half of one dollar, i.e. $0.50. That's 1790 dollars, of course, so a little under $20 today. (There was basically no inflation for the first 100+ years of that because the US dollar was still backed by precious metals then; a dollar was worth slightly more in 1900 than in 1790.)

That seems more like an attempt to codify some amount of plausible actual damages so people aren't arguing endlessly about valuations, rather than an attempt to impose punitive damages. Most notably because -- unlike the current method -- it scales with the number of sheets reproduced.

>>pyman+(OP)
Two week old news.

Some previous discussions:

>>44367850

>>44381838

>>44381639

>>wood_s+4s1
See first-sale doctrine <https://en.wikipedia.org/wiki/First-sale_doctrine>

>>pier25+Fm1
Yes! Training and generation are fair use. You are free to train and generate whatever you want in your basement for whatever purpose you see fit. Build a music collection, go ham.

If the output from said model uses the voice of another person, for example, we already have a legal framework in place for determining if it is infringing on their rights, independent of AI.

Courts have heard cases of individual artists copying melodies, because melodies themselves are copyrightable: https://www.hypebot.com/hypebot/2020/02/every-possible-melod...

Copyright law is a lot more nuanced than anyone seems to have the attention span for.

>>zerocr+cs1
Minor exception: https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootk...

>>pyman+(OP)
Order on Fair Use

https://ia800101.us.archive.org/15/items/gov.uscourts.cand.4...

>>burnt-+QM1
Google set the precedent for this with an even less transformative use case: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

>>Greed+6T2
The problem here is there is no “test” that is known to work here other than checking for direct infringement, which they have a responsibility to do (as they don’t have a license to the originals).

Anything remotely beyond that and we have teams of humans adjudicating specific cases: https://library.mi.edu/musiccopyright/currentcases

>>pyman+(OP)
under the DMCA the minimum penalty for an illegally downloaded file is $750 (https://copyrightresource.uw.edu/copyright-law/dmca/)

"Anthropic had no entitlement to use pirated copies for its central library...Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy." --- the ruling

If they committed piracy 7 million times and the minimum fine for each instance is $750 million then the law says that anthropic is liable for $5.25 billion. I just want it to be out there that they definitely broke the law and the penalty is a minimum $5.25 billion in fines according to the law, this way when none of this actually happens we at least can't pretend we didn't know.

>>x18746+jc5
>Then don't use the service? Use one of the 'better' free alternatives

you can use alternatives but those do not have the actual content that is the reason anybody watches youtube (its in the name).

im also talking about free alternatives to premium being better for example offline videos still having DRM unlike every free yt downloader ever. the only way they have made premium better is by actively making the experience worse for everybody else is by pay walling the old default bitrate.

>Let's see how long they remain free once (if) they actually see a meaningful amount of traffic

there continues effort towards making the platform worse with every decision does not have anything to do with funding

https://www.thetoptens.com/youtube/youtube-features-were-rem...

>>cmiles+er1
https://bsky.app/profile/jtlg.bsky.social/post/3ltn6gtepsc2w

zlacker

Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge