https://torrentfreak.com/spotifys-beta-used-pirate-mp3-files...
Funky quote:
> Rumors that early versions of Spotify used ‘pirate’ MP3s have been floating around the Internet for years. People who had access to the service in the beginning later reported downloading tracks that contained ‘Scene’ labeling, tags, and formats, which are the tell-tale signs that content hadn’t been obtained officially.
https://gizmodo.com/early-spotify-was-built-on-pirated-mp3-f...
https://www.forbes.com/2009/08/04/online-anime-video-technol...
https://venturebeat.com/business/crunchyroll-for-pirated-ani...
https://investors.autodesk.com/news-releases/news-release-de...
Although, there’s an exception for fictional characters:
https://en.m.wikipedia.org/wiki/Copyright_protection_for_fic...
Judges consider a four factor when examining fair use[1]. For search engines,
1) The use is transformative, as a tool to find content is very different purpose than the content itself.
2) Nature of the original work runs the full gamut, so search engines don't get points for only consuming factual data, but it was all publicly viewable by anyone as opposed to books which require payment.
3) The search engine store significant portions of the work in the index, but it only redistributes small portions.
4) Search engines, as original devised, don't compete with the original, in fact they can improve potential market of the original by helping more people find them. This has changed over time though, and search engines are increasingly competing with the content they index, and intentionally trying to show the information that people want on the search page itself.
So traditional search which was transformative, only republished small amounts of the originals, and didn't compete with the originals fell firmly on the side of fair use.
Google News and Books on the other hand weren't so clear cut, as they were showing larger portions of the works and were competing with the originals. They had to make changes to those products as a result of lawsuits.
So now lets look at LLMs:
1) LLM are absolutely transformative. Generating new text at users request is a very different purpose and character from the original works.
2) Again runs the full gamut (setting aside the clear copyright infringement downloading of illegally distributed books which is a separate issue)
3) For training purposes, LLMs don't typically preserve entire works, so the model is in a better place legally than a search index, which has precedent that storing entire works privately can be fair use depending on the other factors. For inference, even though they are less likely to reproduce the originals in their outputs than search engines, there are failure cases where an LLM over-trained on a work, and a significant amount the original can be reproduced.
4) LLMs have tons of uses some of which complement the original works and some of which compete directly with them. Because of this, it is likely that whether LLMs are fair use will depend on how they are being used - eg ignore the LLM altogether and consider solely the output and whether it would be infringing if a human created it.
This case was solely about whether training on books is fair use, and did not consider any uses of the LLM. Because LLMs are a very transformative use, and because they don't store original verbatim, it weighs strongly as being fair use.
I think the real problems that LLMs face will be in factors 3 and 4, which is very much context specific. The judge himself said that the plaintiffs are free to file additional lawsuits if they believe the LLM outputs duplicate the original works.
[1] https://fairuse.stanford.edu/overview/fair-use/four-factors/
First, Authors argue that using works to train Claude’s underlying LLMs was like using
works to train any person to read and write, so Authors should be able to exclude Anthropic
from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for
training or learning as such. Everyone reads texts, too, then writes new texts. They may need
to pay for getting their hands on a text in the first instance. But to make anyone pay
specifically for the use of a book each time they read it, each time they recall it from memory,
each time they later draw upon it when writing new things in new ways would be unthinkable.
For centuries, we have read and re-read books. We have admired, memorized, and internalized
their sweeping themes, their substantive points, and their stylistic solutions to recurring writing
problems.
...
In short, the purpose and character of using copyrighted works to train LLMs to generate
new text was quintessentially transformative. Like any reader aspiring to be a writer,
Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but
to turn a hard corner and create something different. If this training process reasonably
required making copies within the LLM or otherwise, those copies were engaged in a
transformative use.
[1] https://authorsguild.org/app/uploads/2025/06/gov.uscourts.ca...Found it: https://www.nbcnews.com/tech/tech-news/federal-judge-rules-c...
> “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft,” [Judge] Alsup wrote, “but it may affect the extent of statutory damages.”
First, Authors argue that using works to train Claude’s underlying LLMs
was like using works to train any person to read and write, so Authors
should be able to exclude Anthropic from this use (Opp. 16).
Second, to that last point, Authors further argue that the training was
intended to memorize their works’ creative elements — not just their
works’ non-protectable ones (Opp. 17).
Third, Authors next argue that computers nonetheless should not be
allowed to do what people do.
https://media.npr.org/assets/artslife/arts/2025/order.pdfReferring to this? (Wikipedia's disambiguation page doesn't seem to have a more likely article.)
https://en.wikipedia.org/wiki/Richard_Stallman#Copyright_red...
https://www.courtlistener.com/docket/67569326/598/kadrey-v-m...
Note: I am not a lawyer.
Here's an article explaining in more detail [1].
Most experts say that if Swartz had gone to trial and the prosecution had proved everything they alleged and the judge had decided to make an example of Swartz and sentence harshly it would have been around 7 years.
Swartz's own attorney said that if they had gone to trail and lost he thought it was unlikely that Swartz would get any jail time.
Swartz also had at least two plea bargain offers available. One was for a guilty plea and 4 months. The other was for a guilty plea and the prosecutors would ask for 6 months but Swartz could ask the judge for less or for probation instead and the judge would pick.
[1] https://www.popehat.com/2013/02/05/crime-whale-sushi-sentenc...
As a researcher I've been furious that we publish papers where the research data is unknown. To add insult to injury, we have the audacity to start making claims about "zero-shot", "low-shot", "OOD", and other such things. It is utterly laughable. These would be tough claims to make *even if we knew the data*, simply because of its size. But not knowing the data, it is outlandish. Especially because the presumptions are "everything on the internet." It would be like training on all of GitHub and then writing your own simple programming questions to test an LLM[0]. Analyzing that amount of data is just intractable, and we currently do not have the mathematical tools to do so. But this is a much harder problem to crack when we're just conjecturing and ultimately this makes interoperability more difficult.
On top of all of that, we've been playing this weird legal game. Where it seems that every company has had to cheat. I can understand how smaller companies turn to torrenting to compete, but when it is big names like Meta, Google, Nvidia, OpenAI (Microsoft), etc it is just wild. This isn't even following the highly controversial advice of Eric Schmidt "Steal everything, then if you get big, let the lawyers figure it out." This is just "steal everything, even if you could pay for it." We're talking about the richest companies in the entire world. Some of the, if not the, richest companies to ever exist.
Look, can't we just try to be a little ethical? There is, in fact, enough money to go around. We've seen unprecedented growth in the last few years. It was only 2018 when Apple became the first trillion dollar company, 2020 when it became the second two trillion, and 2022 when it became the first three trillion dollar company. Now we have 10 companies north of the trillion dollar mark![3] (5 above $2T and 3 above $3T) These values have exploded in the last 5 years! It feels difficult to say that we don't have enough money to do things better. To at least not completely screw over "the little guy." I am unconvinced that these companies would be hindered if they had to broker some deal for training data. Hell, they're already going to war over data access.
My point here is that these two things align. We're talking about how this technology is so dangerous (every single one of those CEOs has made that statement) and yet we can't remain remotely ethical? How can you shout "ONLY I CAN MAKE SAFE AI" while acting so unethically? There's always moral gray areas but is this really one of them? I even say this as someone who has torrented books myself![4] We are holding back the data needed to make AI safe and interpretable while handing the keys to those who actively demonstrate that they should not hold the power. I don't understand why this is even that controversial.
[0] Yes, this is a snipe at HumanEval. Yes, I will make the strong claim that the dataset was spoiled from day 1. If you doubt it, go read the paper and look at the questions (HuggingFace).
[1] https://www.theverge.com/2024/8/14/24220658/google-eric-schm...
[2] https://en.wikipedia.org/wiki/List_of_public_corporations_by...
[3] https://companiesmarketcap.com/
[4] I can agree it is wrong, but can we agree there is a big difference between a student torrenting a book and a billion/trillion dollar company torrenting millions of books? I even lean on the side of free access to information, and am a fan of Aaron Swartz and SciHub. I make all my works available on ArXiv. But we can recognize there's a big difference between a singular person doing this at a small scale and a huge multi-national conglomerate doing it at a large scale. I can't even believe we so frequently compare these actions!
https://www.computerworld.com/article/1447323/google-reporte...
https://copyright.gov/about/1790-copyright-act.html
Specified in dollars because dollars had been invented (in 1789), but in the amount of one half of one dollar, i.e. $0.50. That's 1790 dollars, of course, so a little under $20 today. (There was basically no inflation for the first 100+ years of that because the US dollar was still backed by precious metals then; a dollar was worth slightly more in 1900 than in 1790.)
That seems more like an attempt to codify some amount of plausible actual damages so people aren't arguing endlessly about valuations, rather than an attempt to impose punitive damages. Most notably because -- unlike the current method -- it scales with the number of sheets reproduced.
If the output from said model uses the voice of another person, for example, we already have a legal framework in place for determining if it is infringing on their rights, independent of AI.
Courts have heard cases of individual artists copying melodies, because melodies themselves are copyrightable: https://www.hypebot.com/hypebot/2020/02/every-possible-melod...
Copyright law is a lot more nuanced than anyone seems to have the attention span for.
Anything remotely beyond that and we have teams of humans adjudicating specific cases: https://library.mi.edu/musiccopyright/currentcases
"Anthropic had no entitlement to use pirated copies for its central library...Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy." --- the ruling
If they committed piracy 7 million times and the minimum fine for each instance is $750 million then the law says that anthropic is liable for $5.25 billion. I just want it to be out there that they definitely broke the law and the penalty is a minimum $5.25 billion in fines according to the law, this way when none of this actually happens we at least can't pretend we didn't know.
you can use alternatives but those do not have the actual content that is the reason anybody watches youtube (its in the name).
im also talking about free alternatives to premium being better for example offline videos still having DRM unlike every free yt downloader ever. the only way they have made premium better is by actively making the experience worse for everybody else is by pay walling the old default bitrate.
>Let's see how long they remain free once (if) they actually see a meaningful amount of traffic
there continues effort towards making the platform worse with every decision does not have anything to do with funding
https://www.thetoptens.com/youtube/youtube-features-were-rem...