zlacker

The best approach to circumventing the nondisclosure agreement is for the affected employees to get together, write out everything they want to say about OpenAI, train an LLM on that text, and then release it.

Based on these companies' arguments that copyrighted material is not actually reproduced by these models, and that any seemingly-infringing use is the responsibility of the user of the model rather than those who produced it, anyone could freely generate an infinite number of high-truthiness OpenAI anecdotes, freshly laundered by the inference engine, that couldn't be used against the original authors without OpenAI invalidating their own legal stance with respect to their own models.

replies(15): >>rlt+P >>bboygr+H1 >>judge2+J1 >>TeMPOr+c2 >>otterl+83 >>visarg+S3 >>Always+05 >>renewi+I5 >>jahews+J6 >>KoolKa+s8 >>NoMore+cc >>andyjo+Cd >>p0w3n3+9f >>b112+Zf >>cqqxo4+Cg

>>mwigda+(OP)
This would be hilarious and genius. Touché.

>>mwigda+(OP)
Genious. I'm praying for this to happen.

>>mwigda+(OP)
NDAs don’t touch the copyright of your speech / written works you produce after leaving, they just make it breach of contract to distribute those words.

replies(3): >>otabde+Q1 >>elicks+R1 >>romwel+y2

>>judge2+J1
Technically, no words are being distributed here. (At least according to OpenAI lawyers.)

>>judge2+J1
Following the legal defense of these companies, the employees wouldn’t be distributing any words. They’re distributing a model.

replies(2): >>JumpCr+h4 >>cqqxo4+Mg

>>mwigda+(OP)
Clever, but no.

The argument about LLMs not being copyright laundromats making sense hinges the scale and non-specificity of training. There's a difference between "LLM reproduced this piece of copyrighted work because it memorized it from being fed literally half the internet", vs. "LLM was intentionally trained to specifically reproduce variants of this particular work". Whatever one's stances on the former case, the latter case would be plain infringing copyrights and admitting to it.

In other words: GPT-4 gets to get away with occasionally spitting out something real verbatim. Llama2-7b-finetune-NYTArticles does not.

replies(10): >>romwel+s2 >>bluefi+w2 >>adra+h3 >>makeit+B3 >>8note+35 >>tadfis+w5 >>dorkwo+R6 >>anigbr+b7 >>aprilt+i9 >>throwa+8e

>>TeMPOr+c2
Cool, just feed the ChatGPT+ the same half the Internet plus OpenAI founders' anecdotes about the company.

Ta-da.

replies(1): >>TeMPOr+H3

>>TeMPOr+c2
Seems absurd that somehow the scale being massive makes it better somehow

You would think having a massive scale just means it has infringed even more copyrights, and therefore should be in even more hot water

replies(6): >>TeMPOr+L2 >>NewJaz+N2 >>omeid2+i5 >>kmeist+47 >>tempod+Ha >>blksv+4f

>>judge2+J1
>they just make it breach of contract to distribute those words.

See, they aren't distributing the words, and good luck proving that any specific words went into training the model.

>>bluefi+w2
You may or may not agree with it, but that's the only thing that makes it different - scale and non-specificity. Same thing that worked for search engines, for example.

My point isn't to argue merits of that case, it's just to point out that OP's joke is like a stereotypical output of an LLM: seems to make sense, but really doesn't.

>>bluefi+w2
My US history teacher taught me something important. He said that if you are going to steal and don't want to get in trouble, steal a whole lot.

replies(3): >>Pontif+l8 >>psycho+9b >>throwa+gf

>>mwigda+(OP)
IAAL (but not your lawyer and this is not legal advice).

That’s not how it works. It doesn’t matter if you write the words yourself or have an agent write them for you. In either case, it’s the communication of the covered information that is proscribed by these kinds of agreements.

>>TeMPOr+c2
Which has been established in court where?

replies(2): >>sundal+I3 >>TeMPOr+54

>>TeMPOr+c2
My take away is that we should talk about our experience in companies at a large enough scale that it becomes non specific in principle, and not targeted at a single company.

Basically, we need our open source version of Glassdoor as a LLM ?

replies(1): >>TeMPOr+Z3

>>romwel+s2
And be rightfully sacked for maliciously burning millions of dollars on a retrain to purposefully poison the model?

Not to mention: LLMs aren't oracles. Whatever they say will be dismissed as hallucinations if it isn't corroborated by other sources.

replies(1): >>romwel+H7

>>adra+h3
+1, this is just the commenter saying what they want without an actual court case

replies(1): >>cj+24

>>mwigda+(OP)
No need for LLM, anonymous letter does the same thing

replies(1): >>throwa+JD

>>makeit+B3
This exists, it's called /r/antiwork :).

OP wants to achieve effects of specific accusation using only non-specific means; that's not easy to pull off.

>>sundal+I3
The justice system moves an order of magnitude slower than technology.

It’s the Wild West. The lack of a court case has no bearing on whether or not what they’re doing is right or wrong.

replies(1): >>6510+6a

>>adra+h3
And it matters how? I didn't say the argument is correct or approved by court, or that I even support it. I'm saying what the argument, which OP referenced, is about, and how it differs from their proposal.

>>elicks+R1
They’re disseminating the information. Form isn’t as important as it is for copyright.

>>mwigda+(OP)
if I slaved away at openai for a year to get some equity, I don't think I would want to be the one to try this strategy

>>TeMPOr+c2
The scale of two people should be large enough to make it ambiguous who spilled the beans at least

>>bluefi+w2
It may not make a lot of sense but it follows the "fair use" doctrine. Which is generally based on the following 4 factors:

1) the purpose and character of use.

2) the nature of the copyrighted material.

3) the *amount* and *substantiality* of the portion taken, and.

4) the effect of the use upon the *potential market*.

So in that regard, if you're training a personal assistance GPT, and use some software code to teach your model logic, that is easy to defend as fair use.

But the extent of use matters, and if you're training an AI for the sole purpose of regurgitating specific copyrighted material, it is infringement, if it is copyrighted, but in this case, it is not copyright issue, it is contracts and NDAs.

>>TeMPOr+c2
To definitively prove this either way, they'll have to make their source code and model available (maybe under subpoena and/or gag order), so don't expect this issue to be actually tested in court (so long as the defendants have enough VC money).

>>mwigda+(OP)
To be honest, you can just say “I don’t have anything to add on that subject” and people will get the impression. No one ever says that about companies they like so you know when people shut down that something was up.

“What was the company culture like?” “Etc. platitude so on and so forth”

“And I heard the CEO was a total dickbag. Was that your experience working with him?” “I don’t have anything to add on that subject”

Of course going back and forth on that won’t really work but to different people you can’t be expected to not say the nice things and then someone could build up a story based on that.

>>mwigda+(OP)
Ha ha, but no. For starters, copyright falls under federal law and contacts under state law, so it’s not even possible to make this claim in the relevant court.

>>TeMPOr+c2
How many sources do you need to steal from for it to no longer be considered stealing? Two? Three? A hundred?

replies(1): >>TeMPOr+r9

>>bluefi+w2
So, the law has this concept of 'de minimus' infringement, where if you take a very small amount - like, way smaller than even a fair use - the courts don't care. If you're taking a handful of word probabilities from every book ever written, then the portion taken from each work is very, very low, so courts aren't likely to care.

If you're only training on a handful of works then you're taking more from them, meaning it's not de minimus.

For the record, I got this legal theory from Cory Doctorow[0], but I'm skeptical. It's very plausible, but at the same time, we also thought sampling in music was de minimus until the Second Circuit said otherwise. Copyright law is extremely malleable in the presence of moneyed interests, sometimes without Congressional intervention even!

[0] who is NOT pro-AI, he just thinks labor law is a better bulwark against it than copyright

replies(4): >>wtalli+J7 >>KoolKa+58 >>bryanr+Zd >>Gravit+jg

>>TeMPOr+c2
It's not a copyright violation if you voluntarily provide the training material...

replies(1): >>XorNot+D7

>>anigbr+b7
I don't know why copyright is getting involved here. The clause is about criticizing the company.

Releasing an LLM trained on company criticisms, by people specifically instructed not to do so is transparently violating the agreement.

Because you're intentionally publishing criticism of the company.

>>TeMPOr+H3
>And be rightfully sacked for maliciously burning millions of dollars on a retrain to purposefully poison the model?

Does it really take millions dollars of compute to add additional training data to an existing model?

Plus, we're talking about employees that are leaving / left anyway.

>Not to mention: LLMs aren't oracles. Whatever they say will be dismissed as hallucinations if it isn't corroborated by other sources.

Excellent. That means plausible deniability.

Surely all those horror stories about unethical behavior are just hallucinations, no matter how specific they are.

Absolutely no reason for anyone to take them seriously. Which is why the press will not hesitate to run with that, with appropriate disclaimers, of course.

Seriously, you seem to think that in a world where numbers about death toll in Gaza are taken verbatim from Hamas without being corroborated by other sources, an AI model output will not pass the test of public scrutiny?

Very optimistic of you.

>>kmeist+47
If your training process ingests the entire text of the book, and trains with a large context size, you're getting more than just "a handful of word probabilities" from that book.

replies(1): >>ben_w+L8

>>kmeist+47
You don't even need to go this far.

The word-probabilities are transformative use, a form of fair use and aren't an issue.

The specific output at each point in time is what would be judged to be fair use or copyright infringing.

I'd argue the user would be responsible for ensuring they're not infringing by using the output in a copyright infringing manner i.e. for profit, as they've fed certain inputs into the model which led to the output. In the same way you can't sue Microsoft for someone typing up copyrighted works into Microsoft Word and then distributing for profit.

De minimus is still helpful here, not all infringments are noteworthy.

replies(3): >>rcbdev+sc >>surfin+wd >>kibibu+Bi

>>NewJaz+N2
Copying one person is plagarism. Copying lots of people is research.

replies(1): >>comfys+Cb

>>mwigda+(OP)
Lol this would be a great performative piece. Although not so sure it'd stand up to scrutiny. Openai could probably take them to court on the grounds of disclosure of trade secrets or something like that and force them to reveal its training data and thus potentially revealing its source.

replies(1): >>nextac+t9

>>wtalli+J7
If you've trained a 16-bit ten billion parameter model on ten trillion tokens, then the mean training token changes 2/125 of a bit, and a 60k word novel (~75k tokens) contributes 1200 bits.

It's up to you if that counts as "a handful" or not.

replies(4): >>hanswo+W9 >>snovv_+sa >>andrep+Od >>throwa+1f

>>TeMPOr+c2
> In other words: GPT-4 gets to get away with occasionally spitting out something real verbatim. Llama2-7b-finetune-NYTArticles does not.

Based on what? This isn't any legal argument that will hold water in any court I'm aware of

>>dorkwo+R6
Copyright infringement is not stealing.

replies(1): >>psycho+8c

>>KoolKa+s8
If they did so, they would open up themselves for lawsuits of people unhappy about OpenAI's own training data.

So they probably won't.

replies(1): >>KoolKa+df

>>ben_w+L8
I think it’s questionable whether you can actually use this bit count to represent the amount of information from the book. Those 1200 bits represent the way in which this particular book is different from everything else the model has ingested. Similarly, if you read an entire book yourself, your brain will just store the salient bits, not the entire text, unless you have a photographic memory.

If we take math or computer science for example: some very important algorithms can be compressed to a few bits of information if you (or a model) have a thorough understanding of the surrounding theory to go with it. Would it not amount to IP infringement if a model regurgitates the relevant information from a patent application, even if it is represented by under a kilobyte of information?

replies(1): >>ben_w+ee

>>cj+24
Sounds like the standard disrupt formula should apply. Cant we stuff the court into an app? I kinda dislike the idea of getting a different sentence for anything related to appearance or presentation.

>>ben_w+L8
If I invent an amazing lossless compression algorithm such that adding an entire 60k word novel to my blob only increases the size by 1.2kb, does that mean I'm not copyright infringing if I release that model?

replies(1): >>Sharli+0e

>>bluefi+w2
Almost reminds one of real life: The big thieves get away and have a fan base while the small ones get prosecuted as criminals.

>>NewJaz+N2
Scale might be a factor, but it's not the only one. Your neighbor might not care if you steal a grass stalk in its lawn, and feel powerless if you're the bloody dictator of the country which wastes tremendous amount of resources in socially useless whims thanks to overwhelming taxes.

But most people don't want to live in permanent mental distress due to shame of past action or fear of rebellion, I guess.

>>Pontif+l8
True, but if you research lots of sources and still emit significant blocks of verbatim text without attribution, it’s still plagiarism. At least that’s how human authors are judged.

replies(1): >>TeMPOr+3e

>>TeMPOr+r9
True.

Making people believe that anything but their own body and mind can be considered part of their own properties is stealing their lucidity.

>>mwigda+(OP)
NDA's don't rely on copyright to protect the party who drafted it from disclosure. There might even be an argument to be made that training the LLM on it was disclosure, regardless of whether you release the LLM publicly or not. We all work in tech right? Why do even you people get intellectual property so wrong, every single time?

>>KoolKa+58
OpenAI is outputting the partially copyright-infringing works of their LLM for profit. How does that square?

replies(2): >>throwa+Ge >>KoolKa+bg

>>KoolKa+58
MS Word does not actively collect and process all texts for all available sources and does not offer them in recombined form. MS Word is passive whereas the whole point of an LLM is to produce output using a model trained on ingested data. It is actively processing vast amounts of texts with intent to make them available for others to use and the T&C state that the user owns the copyright to the outputs based on works of other copyright owners. LLMs give the user a CCL (Collateralised Copyright Liability, a bit like a CDO) without a way of tracing the sources used to train the model.

replies(2): >>throwa+re >>KoolKa+Jf

>>mwigda+(OP)
Clever, but the law is not a machine or an algorithm. Intent matters.

Training an LLM with the intent of contravening an NDA is just plain <intent to contravene an NDA>. Everyone would still get sued anyway.

replies(2): >>jeffre+lf >>bazoom+tg

>>ben_w+L8
xz can compress the text of Harry Potter by a factor of 30:1. Does that mean I can also distribute compressed copies of copyrighted works and that's okay?

replies(3): >>ben_w+se >>Sharli+Ae >>realus+5f

>>kmeist+47
>we also thought sampling in music was de minimus

I would think if I can recognize exactly what song it comes from - not de minimus.

replies(1): >>throwa+Qe

>>snovv_+sa
How is that relevant? If some LLM were able to regurgitate a 60k word novel verbatim on demand, sure, the copyright situation would be different. But last I checked they can’t, not 60k, 6k, or even 600 words. Perhaps they can do 60 words of some well-known passages from the Bible or other similar ubiquitous copyright-free works.

replies(1): >>snovv_+0c2

>>comfys+Cb
Plagiarism is not illegal, it is merely frowned on, and only in certain fields at that.

replies(1): >>bayind+1n

>>TeMPOr+c2

    > LLMs not being copyright laundromats

This a brilliant phrase. You might as well put that into an Emacs paste macro now. It won't be the last time you will need it. And the OP is classic HN folly where programmer thinks laws and courts can be hacked with "this one weird trick".

replies(1): >>calvin+Ie

>>hanswo+W9
I agree with what I think you're saying, so I'm not sure I've understood you.

I think this is all still compatible with saying that ingesting an entire book is still:

> If you're taking a handful of word probabilities from every book ever written, then the portion taken from each work is very, very low

(Though I wouldn't want to make a bet either way on "so courts aren't likely to care" that follows on from that quote: my not-legally-trained interpretation of the rules leads to me being confused about how traditional search engines aren't a copyright violation).

>>surfin+wd
First, I agree with nearly everything that you wrote. Very thoughtful post! However, I have some issues with the last sentence.

    > Collateralised Copyright Liability

Is this a real legal / finance term or did you make it up?

Also, I do not follow you leap to compare LLMs to CDOs (collateralised debt obligations). And, do you specifically mean CDO or any kind of mortgage / commercial loan structured finance deal?

replies(1): >>surfin+rf

>>andrep+Od
Can you get that book out of an LLM?

Because that's the distinction being argued here: it's "a handful"[0] of probabilities, not the complete work.

[0] I'm not sold on the phrasing "a handful", but I don't care enough to argue terminology; the term "handful" feels like it's being used in a sorites paradox kind of way: https://en.wikipedia.org/wiki/Sorites_paradox

>>andrep+Od
Incredibly poor analogy. If an LLM were able to regurgitate Harry Potter on demand like xz can, the copyright situation would be much more black and white. But they can’t, and it’s not even close.

>>rcbdev+sc
You raise an interesting point. If more professional lawyers agreed with you, then why have we not seen a lawsuit from publishers against OpenAI?

replies(2): >>dgolds+0h >>kmeist+Xm3

>>throwa+8e
But they can, just look at AirBnB, Uber, etc.

replies(2): >>throwa+pf >>abofh+9h

>>bryanr+Zd
When I was younger, I was told that the album from Beastie Boys called Paul's Boutique was the straw that broke the camel's back! I have no idea if this true, but that album has a batshit crazy amount of recognizable samples. I doubt very much that Beastie paid anything for the rights to sample.

>>ben_w+L8
To be fair, OP raises an important question that I hope smart legal minds are pondering. In my view, they aren't looking for a "programmer answers about legal issue" response. Probably the right court might agree with their premise. What the damages or restrictions might be, I cannot speculate. Any IP lawyers here who want to share some thoughts?

replies(1): >>ben_w+Og

>>bluefi+w2
It is the same scale argument that allows you to publish a photo of a procession without written consent from every participant.

>>andrep+Od
You can't get Harry Potter out of the LLM, that's the difference

>>mwigda+(OP)
that's the evilest thing I can imagine - fighting with them with their own weapon

>>nextac+t9
Good point

>>NewJaz+N2
Very interesting post! Can you share more about your teacher's reasoning?

replies(1): >>SuchAn+0g

>>andyjo+Cd
But then training a commercial model is done with the intent to not pay the original authors, how is that different?

replies(3): >>kdnvk+Yf >>repeek+ph >>mpweih+6i

>>calvin+Ie
No, lots of jurisdictions outside the US fought back against those shady practices.

>>throwa+re
My analogy is based on the fact that nobody could see what was inside CDOs nor did they want to see, all they wanted to do was pass them on to the next sucker. It was all fun until it all blew up. LLM operators behave in the same way with copyrighted material. For context, read https://nymag.com/news/business/55687/

replies(1): >>throwa+k24

>>surfin+wd
Legally, copyright is only concerned with the specific end work. A unique or not so unique standalone object that is being scrutinized, if this analogy helps.

The process involved in obtaining that end work is completely irrelevant to any copyright case. It can be a claim against the models weights (not possible as it's fair use), or it's against the specific once off output end work (less clear), but it can't be looked at as a whole.

replies(1): >>dgolds+sh

>>jeffre+lf
It’s not done with the intent to infringe copyright.

replies(1): >>binket+ah

>>mwigda+(OP)
Copyright != an NDA. Copyright is not an agreement between two entities, but a US federal law, with international obligations both ratified and not.

Copyright has fair uses clauses, endless court decisions limiting its use, carve outs for libraries, additional junk like the DMCA and more slapped on top. It's a patchwork of dozens of treaties and laws, spanning hundreds of years.

For example, you can read a book to a room full of kids, you can use copyright materials in comedic skits, you can quote snippets, the list goes on. And again, this is all legislated.

The point? It's complex, and specific usage of copyrighted works infringing or not, can be debatable without intent immediately being malign.

Meanwhile, an NDA covers far, far more than copyright. It may cover discussion and disclosure of everything or anything, including even client lists, trade secrets, work processes, and more. It is signed, and agreed to by both parties involved. Equating "copyright law" to "an NDA" is a non-starter. There's literally zero legal parallel or comparison here.

And as others have mentioned, the intent of the act would be malicious on top of all of this.

I know a lot of people dislike the whole data snag by OpenAI, and have moral or ethical objections to closed models, but thinking anyone would care about this argument if you breach an NDA is a bad idea. No judge would even remotely accept or listen to such chicanery.

>>throwa+gf
It likely comes from the saying similar to this one: "kill a few, you are a murderer. Kill millions, you are a conqueror".

More generally, we tend to view number of causalities in war as a large number, and not as the sum of every tragedies that it represent and that we perceive when fewer people die.

>>rcbdev+sc
You, the user, is inputting variables into their probability algorithm that's resulting in the copyright work. It's just a tool.

replies(3): >>DaSHac+Hh >>maeil+ui >>rcbdev+oo4

>>kmeist+47
I think with some AI you could reproduce artworks of obscure indie artists who are working right now.

If you were a director at a game company and needed art in that style, it would be cheaper to have the AI do it instead of buying from the artist.

I think this is currently an open question.

replies(1): >>dgolds+8k

>>andyjo+Cd
It is a classic geek fallacy to think you can hack the law with logic tricks.

replies(1): >>andyjo+lh

>>mwigda+(OP)
I’m going to break rank from everyone else and explicitly say “not clever”. Developers that think that they know how the levels system works are a dime a dozen. It’s both easy and useless to take some acquired-in-passing largely incorrect surface level understanding of a legal mechanic and “pwned with facts and logic!” in whichever way benefits you.

>>elicks+R1
Please just stop. It’s highly unlikely that any relevant part of any reasonably structured NDA has any material relevance to copyright. Why do developers think that they can just intuit this stuff? This is one step away from being a more trendy “stick the constitution to the back of my car in lieu of a license place” lunacy.

replies(1): >>elicks+JC1

>>throwa+1f
Yup, that's fair.

As my not-legally-trained interpretation of the rules leads to me being confused about how traditional search engines aren't a copyright violation, I don't trust my own beliefs about the law.

>>throwa+Ge
Some of them are suing

https://www.nytimes.com/2023/12/27/business/media/new-york-t... https://www.reuters.com/legal/us-newspapers-sue-openai-copyr... https://www.washingtonpost.com/technology/2024/04/09/openai-...

Some decided to make deals instead

>>calvin+Ie
You mean unregulated hotels and on-demand taxis?

Uber is no longer subsidized (or even cheap) in most places, it's just an app for summoning taxis and overpriced snacks. AirBnB is underregulated housing for nomads at this point.

Your examples sorta prove the point - they didn't succeed in what they aimed at doing, so they pivoted until the law permitted it.

>>kdnvk+Yf
It would appear that it explicitly IS done with this intent. We are told that an LLM is a living being that merely learns and then creates, but yet we are aware that its outputs regurgitate combinations of uta inputs.

>>bazoom+tg
Indeed it is. Obligatory xkcd - https://xkcd.com/1494/

>>jeffre+lf
> done with the intent to not pay the original authors

no one building this software wants to “steal from creators” and the legal precedent for using copyrighted works for the purpose of training is clear with the NYT case against open AI

It’s why things like the recent deal with Reddit to train on their data (which Reddit owns and users give up when using the platform) are becoming so important, same with Twitter/X

replies(1): >>kaoD+yj

>>KoolKa+Jf
I don't think that's accurate. The us copyright office last year issued guidance that basically said anything generated with ai can't be copyrighted, as human authorship/creation is required for copyright. Works can incorporate ai generated content but then those parts aren't covered by copyright.

https://www.federalregister.gov/documents/2023/03/16/2023-05...

So I think the law, at least as currently interpreted, does care about the process.

Though maybe you meant as to whether a new work infringes existing copyright? As this guidance is clearly about new copyright.

replies(2): >>KoolKa+ti >>arrows+Zk

>>KoolKa+bg
How is it any different than training a model on content protected under an NDA and allowing access to users via a web-portal?

What is the difference OpenAI has that lets them get away with, but not our hypothetical Mr. Smartass doing the same process trying to get around an NDA?

replies(1): >>KoolKa+5k

>>jeffre+lf
Chutzpah. And that the companies doing it are multi-billion dollar companies who can afford the finest legal representation money can buy.

Whether the brazenness with which they are doing this will work out for them is currently playing out in the courts.

>>dgolds+sh
These are two sides of the same coin, and what I'm saying still stands. This is talking about who you attribute authorship to when copyrighting a specific work. Basically on the application form, the author must be a human. The reason it's worth them clarifying is because they've received applications that attributed AI's, and legal persons do exist that aren't human (such as companies), they're just making it clear it has to be human.

Who created the work, it's the user who instructed the AI (it's a tool), you can't attribute it to the AI. It would be the equivalent of Photoshop being attributed as co-author on your work.

>>KoolKa+bg
Let's say a torrent website asks the user through an LLM interface what kind of copyrighted content they want to download and then offers me links based on that, and makes money off of it.

The user is "inputting variables into their probability algorithm that's resulting in the copyright work".

replies(1): >>KoolKa+Bk

>>KoolKa+58
Is converting an audio signal into the frequency domain, pruning all inaudible frequencies, and then Huffman encoding it tranformative?

replies(1): >>KoolKa+Bj

>>repeek+ph
> no one building this software wants to “steal from creators”

> It’s why things like the recent deal[s ...] are becoming so important

Sorry but I don't follow. Is it one or the other?

If they didn't want to steal from the original authors, why do they not-steal Reddit now? What happens with the smaller creators that are not Reddit? When is OpenAI meeting with me to discuss compensation?

To me your post felt something like "I'm not robbing you, Small State Without Defense that I just invaded, I just want to have your petroleum, but I'm paying Big State for theirs cause they can kick my ass".

Aren't the recent deals actually implying that everything so far has actually been done with the intent of not compensating their source data creators? If that was not the case, they wouldn't need any deals now, they'd just continue happily doing whatever they've been doing which is oh so clearly lawful.

What did I miss?

replies(1): >>repeek+0q

>>kibibu+Bi
Well if the end result is something completely different such as an algorithm for determining which music is popular or determining which song is playing then yes it's transformative.

It's not merely a compressed version of a song intended to be used in the same way as the original copyright work, this would be copyright infringement.

>>DaSHac+Hh
Well if OpenAI signed an NDA beforehand to not disclose certain training data it used, and then users actually do access this data, then yes it would be problematic for OpenAI, under the terms of their signed NDA.

>>Gravit+jg
I recently read an article that I annoyingly can't find again about an art director at a company that decided to hire some prompters. They got some art, told them to completely change it, got other art, told them to make smaller changes... And then got nothing useful as the prompters couldn't tell the ai "like that but make this change". Ai art may get there in a few years or maybe a decade or two, but it's not there yet. (End of that article: they fired the prompters after a few days)

An ai-enhanced Photoshop, however, could do wonders though as the base capabilities seem to be mostly there. Haven't used any of the newer ai stuff myself but https://www.shruggingface.com/blog/how-i-used-stable-diffusi... makes it pretty clear the building blocks seem largely there. So my guess is the main disconnect is in making the machines understand natural language instructions for how to change the art.

>>maeil+ui
Theoretically a torrent website that does not distribute the copyright files themselves in anyway should be legal, unless there's a specific law for this (I'm unaware of any, but I may be wrong).

They tend to try argue for conspiracy to commit copyright infringement, it's a tenuous case to make unless they can prove that was actually their intention. I think in most cases it's ISP/hosting terms and conditions and legal costs that lead to their demise.

Your example of the model asking specifically "what copyrighted content would you like to download", kinda implies conspiracy to commit copyright infringement would be a valid charge.

>>dgolds+sh
Couldn't you just generate it with AI then say you wrote it? How could anyone prove you wrong?

replies(1): >>KoolKa+Vl

>>arrows+Zk
That's what you're supposed to do. No need to hide it either :).

>>TeMPOr+3e
This is a reductionist take. Maybe it's not illegal per se where you live, but it always have ramifications, and these ramifications affect your future a whole lot.

>>kaoD+yj
The law is slow and is always playing catch up in terms of prosecution, it’s not clear today because this kind of copyright has never been an issue before. Usually it’s just outright stealing content that was protected, no one ever imagined “training” to be a protected use case, humans “train” on copyrighted works all the time, ideally copyrighted works they purchased for said purpose… the same will start to apply for AI, you have to have rights to the data for that purpose, hence these deals getting made. In the meantime it’s ask for forgiveness not permission, and companies like Google (less openAI) are ready to go with data governance that lets them remove copyright requested data and keep the rest of the model working fine

Let’s also be clear that making deals with Reddit isn’t stealing from creators, it’s not a platform where you own what you type in, same on here this is all public domain with no assumed rights to the text. If you write a book and openAI trains on it and starts telling it to kids at bed time, you 100% will have a legal claim in the future, but the companies already have protections in place to prevent exactly that. For example if you own your website you can request the data not be crawled, but ultimately if your text is publicly available anyone is allowed to read it, and the question it is anyone allowed to train AI on it is an open question that companies are trying to get ahead on.

replies(1): >>kaoD+Iq

>>repeek+0q
That seems even worse: they had intent to steal and now they're trying to make sure it is properly legislated so nobody else can do it, thus reducing competition.

GPT can't get retroactively untrained on stolen data.

replies(1): >>repeek+5s

>>kaoD+Iq
Google actually can “untrain” afaik, my limited understanding is they have good controls their data and its sources, because they know it could be important in the future, GPT not sure.

I’m not sure what you mean by “steal” because it’s a relative term now, me reading your book isn’t stealing if I paid for it and it inspires me to write my own novel about a totally new story. And if you posted your book online, as of right now the legal precedent is you didn’t make any claims to it (anyone could read it for free) so that’s fair game to train on, just like the text I’m writing now also has no protections.

Nearly all Reddit history ever up to a certain date is available for download now online, only until they changed their policies did they start having tighter controls about how their data could be used.

>>visarg+S3
On first blush, this sounds like a good idea. Thinking deeper, the company is so small that it will be easy to identify the author.

>>cqqxo4+Mg
Actually, I’m a licensed attorney having some fun exploring tongue-in-cheek legal arguments on the internet.

But, I could also be a dog.

>>Sharli+0e
So the fact that it's a lossy compression algorithm makes it ok?

replies(1): >>ben_w+t43

>>snovv_+0c2
"It's lossy" is in isolation much too vague to say if it's OK or not.

A compression algorithm which loses 1 bit of real data is obviously not going to protect you from copyright infringement claims, something that reduces all inputs to a single bit is obviously fine.

So, for example, what the NYT is suing over is that it (or so it is claimed) allows the model to regenerate entire articles, which is not OK.

But to claim that it is a copyright infringement to "compress" a Harry Potter novel to 1200 bits, is to say that this:

> Harry Potter discovers he is a wizard and attends Hogwarts, where he battles dark forces, including the evil Voldemort, to save the wizarding world.

… which is just under 1200 bits, is an unlawful thing to post (and for the purpose of the hypothetical, imagine that quotation in the form of a zero-context tweet rather than the actual fact of this being a case of fair-use because of its appearance in a discussion about copyright infringement of novels).

I think anyone who suggests suing over this to a lawyer, would discover that lawyers can in fact laugh.

Now, there's also the question of if it's legal or not to train a model on all of the Harry Potter fan wikis, which almost certainly have a huge overlap with the contents of the novels and thus strengthens these same probabilities; some people accuse OpenAI et al of "copyright laundering", and I think ingesting derivative works such as fan sites would be a better description of "copyright laundering" than the specific things they're formally accused of in the lawsuits.

>>throwa+Ge
There are some lawsuits, especially in the very reflexively copyright-pilled industries. However, a good chunk of publishers aren't suing for self-interested reasons. There's a lot of people in the creative industry who see a machine that can cut artists out of the copyright bargain completely and are shouting "omg piracy is based now" because LLMs can spit out content faster and for free.

>>surfin+rf

    > nobody could see what was inside CDOs

Absolutely not true. Where did you get that idea? When pricing the bonds from a CDO you get to see the initial collateral. As a bond owner, you receive monthly updates about any portfolio updates. Weirdly, CDOs frequently have more collateral transparency compared to commercial or residential mortgage deals.

>>KoolKa+bg
Yes, a tool that they charge me money to use.

replies(1): >>KoolKa+fE8

>>rcbdev+oo4
Just like any other tool that can be used to plagiarize, Photoshop, Word etc.