zlacker

[return to "OpenAI departures: Why can’t former employees talk?"]
1. mwigda+OQ[view] [source] 2024-05-18 04:13:00
>>fnbr+(OP)
The best approach to circumventing the nondisclosure agreement is for the affected employees to get together, write out everything they want to say about OpenAI, train an LLM on that text, and then release it.

Based on these companies' arguments that copyrighted material is not actually reproduced by these models, and that any seemingly-infringing use is the responsibility of the user of the model rather than those who produced it, anyone could freely generate an infinite number of high-truthiness OpenAI anecdotes, freshly laundered by the inference engine, that couldn't be used against the original authors without OpenAI invalidating their own legal stance with respect to their own models.

◧◩
2. TeMPOr+0T[view] [source] 2024-05-18 04:55:59
>>mwigda+OQ
Clever, but no.

The argument about LLMs not being copyright laundromats making sense hinges the scale and non-specificity of training. There's a difference between "LLM reproduced this piece of copyrighted work because it memorized it from being fed literally half the internet", vs. "LLM was intentionally trained to specifically reproduce variants of this particular work". Whatever one's stances on the former case, the latter case would be plain infringing copyrights and admitting to it.

In other words: GPT-4 gets to get away with occasionally spitting out something real verbatim. Llama2-7b-finetune-NYTArticles does not.

◧◩◪
3. bluefi+kT[view] [source] 2024-05-18 05:01:51
>>TeMPOr+0T
Seems absurd that somehow the scale being massive makes it better somehow

You would think having a massive scale just means it has infringed even more copyrights, and therefore should be in even more hot water

◧◩◪◨
4. kmeist+SX[view] [source] 2024-05-18 06:20:10
>>bluefi+kT
So, the law has this concept of 'de minimus' infringement, where if you take a very small amount - like, way smaller than even a fair use - the courts don't care. If you're taking a handful of word probabilities from every book ever written, then the portion taken from each work is very, very low, so courts aren't likely to care.

If you're only training on a handful of works then you're taking more from them, meaning it's not de minimus.

For the record, I got this legal theory from Cory Doctorow[0], but I'm skeptical. It's very plausible, but at the same time, we also thought sampling in music was de minimus until the Second Circuit said otherwise. Copyright law is extremely malleable in the presence of moneyed interests, sometimes without Congressional intervention even!

[0] who is NOT pro-AI, he just thinks labor law is a better bulwark against it than copyright

◧◩◪◨⬒
5. wtalli+xY[view] [source] 2024-05-18 06:29:07
>>kmeist+SX
If your training process ingests the entire text of the book, and trains with a large context size, you're getting more than just "a handful of word probabilities" from that book.
◧◩◪◨⬒⬓
6. ben_w+zZ[view] [source] 2024-05-18 06:46:31
>>wtalli+xY
If you've trained a 16-bit ten billion parameter model on ten trillion tokens, then the mean training token changes 2/125 of a bit, and a 60k word novel (~75k tokens) contributes 1200 bits.

It's up to you if that counts as "a handful" or not.

◧◩◪◨⬒⬓⬔
7. snovv_+g11[view] [source] 2024-05-18 07:13:43
>>ben_w+zZ
If I invent an amazing lossless compression algorithm such that adding an entire 60k word novel to my blob only increases the size by 1.2kb, does that mean I'm not copyright infringing if I release that model?
◧◩◪◨⬒⬓⬔⧯
8. Sharli+O41[view] [source] 2024-05-18 08:08:33
>>snovv_+g11
How is that relevant? If some LLM were able to regurgitate a 60k word novel verbatim on demand, sure, the copyright situation would be different. But last I checked they can’t, not 60k, 6k, or even 600 words. Perhaps they can do 60 words of some well-known passages from the Bible or other similar ubiquitous copyright-free works.
◧◩◪◨⬒⬓⬔⧯▣
9. snovv_+O23[view] [source] 2024-05-19 07:01:21
>>Sharli+O41
So the fact that it's a lossy compression algorithm makes it ok?
◧◩◪◨⬒⬓⬔⧯▣▦
10. ben_w+hV3[view] [source] 2024-05-19 16:59:41
>>snovv_+O23
"It's lossy" is in isolation much too vague to say if it's OK or not.

A compression algorithm which loses 1 bit of real data is obviously not going to protect you from copyright infringement claims, something that reduces all inputs to a single bit is obviously fine.

So, for example, what the NYT is suing over is that it (or so it is claimed) allows the model to regenerate entire articles, which is not OK.

But to claim that it is a copyright infringement to "compress" a Harry Potter novel to 1200 bits, is to say that this:

> Harry Potter discovers he is a wizard and attends Hogwarts, where he battles dark forces, including the evil Voldemort, to save the wizarding world.

… which is just under 1200 bits, is an unlawful thing to post (and for the purpose of the hypothetical, imagine that quotation in the form of a zero-context tweet rather than the actual fact of this being a case of fair-use because of its appearance in a discussion about copyright infringement of novels).

I think anyone who suggests suing over this to a lawyer, would discover that lawyers can in fact laugh.

Now, there's also the question of if it's legal or not to train a model on all of the Harry Potter fan wikis, which almost certainly have a huge overlap with the contents of the novels and thus strengthens these same probabilities; some people accuse OpenAI et al of "copyright laundering", and I think ingesting derivative works such as fan sites would be a better description of "copyright laundering" than the specific things they're formally accused of in the lawsuits.

[go to top]