The argument about LLMs not being copyright laundromats making sense hinges the scale and non-specificity of training. There's a difference between "LLM reproduced this piece of copyrighted work because it memorized it from being fed literally half the internet", vs. "LLM was intentionally trained to specifically reproduce variants of this particular work". Whatever one's stances on the former case, the latter case would be plain infringing copyrights and admitting to it.
In other words: GPT-4 gets to get away with occasionally spitting out something real verbatim. Llama2-7b-finetune-NYTArticles does not.
Ta-da.
You would think having a massive scale just means it has infringed even more copyrights, and therefore should be in even more hot water
My point isn't to argue merits of that case, it's just to point out that OP's joke is like a stereotypical output of an LLM: seems to make sense, but really doesn't.
Basically, we need our open source version of Glassdoor as a LLM ?
Not to mention: LLMs aren't oracles. Whatever they say will be dismissed as hallucinations if it isn't corroborated by other sources.
OP wants to achieve effects of specific accusation using only non-specific means; that's not easy to pull off.
It’s the Wild West. The lack of a court case has no bearing on whether or not what they’re doing is right or wrong.
1) the purpose and character of use.
2) the nature of the copyrighted material.
3) the *amount* and *substantiality* of the portion taken, and.
4) the effect of the use upon the *potential market*.
So in that regard, if you're training a personal assistance GPT, and use some software code to teach your model logic, that is easy to defend as fair use.
But the extent of use matters, and if you're training an AI for the sole purpose of regurgitating specific copyrighted material, it is infringement, if it is copyrighted, but in this case, it is not copyright issue, it is contracts and NDAs.
If you're only training on a handful of works then you're taking more from them, meaning it's not de minimus.
For the record, I got this legal theory from Cory Doctorow[0], but I'm skeptical. It's very plausible, but at the same time, we also thought sampling in music was de minimus until the Second Circuit said otherwise. Copyright law is extremely malleable in the presence of moneyed interests, sometimes without Congressional intervention even!
[0] who is NOT pro-AI, he just thinks labor law is a better bulwark against it than copyright
Releasing an LLM trained on company criticisms, by people specifically instructed not to do so is transparently violating the agreement.
Because you're intentionally publishing criticism of the company.
Does it really take millions dollars of compute to add additional training data to an existing model?
Plus, we're talking about employees that are leaving / left anyway.
>Not to mention: LLMs aren't oracles. Whatever they say will be dismissed as hallucinations if it isn't corroborated by other sources.
Excellent. That means plausible deniability.
Surely all those horror stories about unethical behavior are just hallucinations, no matter how specific they are.
Absolutely no reason for anyone to take them seriously. Which is why the press will not hesitate to run with that, with appropriate disclaimers, of course.
Seriously, you seem to think that in a world where numbers about death toll in Gaza are taken verbatim from Hamas without being corroborated by other sources, an AI model output will not pass the test of public scrutiny?
Very optimistic of you.
The word-probabilities are transformative use, a form of fair use and aren't an issue.
The specific output at each point in time is what would be judged to be fair use or copyright infringing.
I'd argue the user would be responsible for ensuring they're not infringing by using the output in a copyright infringing manner i.e. for profit, as they've fed certain inputs into the model which led to the output. In the same way you can't sue Microsoft for someone typing up copyrighted works into Microsoft Word and then distributing for profit.
De minimus is still helpful here, not all infringments are noteworthy.
It's up to you if that counts as "a handful" or not.
Based on what? This isn't any legal argument that will hold water in any court I'm aware of
If we take math or computer science for example: some very important algorithms can be compressed to a few bits of information if you (or a model) have a thorough understanding of the surrounding theory to go with it. Would it not amount to IP infringement if a model regurgitates the relevant information from a patent application, even if it is represented by under a kilobyte of information?
But most people don't want to live in permanent mental distress due to shame of past action or fear of rebellion, I guess.
Making people believe that anything but their own body and mind can be considered part of their own properties is stealing their lucidity.
I would think if I can recognize exactly what song it comes from - not de minimus.
> LLMs not being copyright laundromats
This a brilliant phrase. You might as well put that into an Emacs paste macro now. It won't be the last time you will need it. And the OP is classic HN folly where programmer thinks laws and courts can be hacked with "this one weird trick".I think this is all still compatible with saying that ingesting an entire book is still:
> If you're taking a handful of word probabilities from every book ever written, then the portion taken from each work is very, very low
(Though I wouldn't want to make a bet either way on "so courts aren't likely to care" that follows on from that quote: my not-legally-trained interpretation of the rules leads to me being confused about how traditional search engines aren't a copyright violation).
> Collateralised Copyright Liability
Is this a real legal / finance term or did you make it up?Also, I do not follow you leap to compare LLMs to CDOs (collateralised debt obligations). And, do you specifically mean CDO or any kind of mortgage / commercial loan structured finance deal?
Because that's the distinction being argued here: it's "a handful"[0] of probabilities, not the complete work.
[0] I'm not sold on the phrasing "a handful", but I don't care enough to argue terminology; the term "handful" feels like it's being used in a sorites paradox kind of way: https://en.wikipedia.org/wiki/Sorites_paradox
The process involved in obtaining that end work is completely irrelevant to any copyright case. It can be a claim against the models weights (not possible as it's fair use), or it's against the specific once off output end work (less clear), but it can't be looked at as a whole.
More generally, we tend to view number of causalities in war as a large number, and not as the sum of every tragedies that it represent and that we perceive when fewer people die.
If you were a director at a game company and needed art in that style, it would be cheaper to have the AI do it instead of buying from the artist.
I think this is currently an open question.
As my not-legally-trained interpretation of the rules leads to me being confused about how traditional search engines aren't a copyright violation, I don't trust my own beliefs about the law.
https://www.nytimes.com/2023/12/27/business/media/new-york-t... https://www.reuters.com/legal/us-newspapers-sue-openai-copyr... https://www.washingtonpost.com/technology/2024/04/09/openai-...
Some decided to make deals instead
Uber is no longer subsidized (or even cheap) in most places, it's just an app for summoning taxis and overpriced snacks. AirBnB is underregulated housing for nomads at this point.
Your examples sorta prove the point - they didn't succeed in what they aimed at doing, so they pivoted until the law permitted it.
https://www.federalregister.gov/documents/2023/03/16/2023-05...
So I think the law, at least as currently interpreted, does care about the process.
Though maybe you meant as to whether a new work infringes existing copyright? As this guidance is clearly about new copyright.
What is the difference OpenAI has that lets them get away with, but not our hypothetical Mr. Smartass doing the same process trying to get around an NDA?
Who created the work, it's the user who instructed the AI (it's a tool), you can't attribute it to the AI. It would be the equivalent of Photoshop being attributed as co-author on your work.
The user is "inputting variables into their probability algorithm that's resulting in the copyright work".
It's not merely a compressed version of a song intended to be used in the same way as the original copyright work, this would be copyright infringement.
An ai-enhanced Photoshop, however, could do wonders though as the base capabilities seem to be mostly there. Haven't used any of the newer ai stuff myself but https://www.shruggingface.com/blog/how-i-used-stable-diffusi... makes it pretty clear the building blocks seem largely there. So my guess is the main disconnect is in making the machines understand natural language instructions for how to change the art.
They tend to try argue for conspiracy to commit copyright infringement, it's a tenuous case to make unless they can prove that was actually their intention. I think in most cases it's ISP/hosting terms and conditions and legal costs that lead to their demise.
Your example of the model asking specifically "what copyrighted content would you like to download", kinda implies conspiracy to commit copyright infringement would be a valid charge.
A compression algorithm which loses 1 bit of real data is obviously not going to protect you from copyright infringement claims, something that reduces all inputs to a single bit is obviously fine.
So, for example, what the NYT is suing over is that it (or so it is claimed) allows the model to regenerate entire articles, which is not OK.
But to claim that it is a copyright infringement to "compress" a Harry Potter novel to 1200 bits, is to say that this:
> Harry Potter discovers he is a wizard and attends Hogwarts, where he battles dark forces, including the evil Voldemort, to save the wizarding world.
… which is just under 1200 bits, is an unlawful thing to post (and for the purpose of the hypothetical, imagine that quotation in the form of a zero-context tweet rather than the actual fact of this being a case of fair-use because of its appearance in a discussion about copyright infringement of novels).
I think anyone who suggests suing over this to a lawyer, would discover that lawyers can in fact laugh.
Now, there's also the question of if it's legal or not to train a model on all of the Harry Potter fan wikis, which almost certainly have a huge overlap with the contents of the novels and thus strengthens these same probabilities; some people accuse OpenAI et al of "copyright laundering", and I think ingesting derivative works such as fan sites would be a better description of "copyright laundering" than the specific things they're formally accused of in the lawsuits.
> nobody could see what was inside CDOs
Absolutely not true. Where did you get that idea? When pricing the bonds from a CDO you get to see the initial collateral. As a bond owner, you receive monthly updates about any portfolio updates. Weirdly, CDOs frequently have more collateral transparency compared to commercial or residential mortgage deals.