zlacker

If your training process ingests the entire text of the book, and trains with a large context size, you're getting more than just "a handful of word probabilities" from that book.

replies(1): >>ben_w+21

>>wtalli+(OP)
If you've trained a 16-bit ten billion parameter model on ten trillion tokens, then the mean training token changes 2/125 of a bit, and a 60k word novel (~75k tokens) contributes 1200 bits.

It's up to you if that counts as "a handful" or not.

replies(4): >>hanswo+d2 >>snovv_+J2 >>andrep+56 >>throwa+i7

>>ben_w+21
I think it’s questionable whether you can actually use this bit count to represent the amount of information from the book. Those 1200 bits represent the way in which this particular book is different from everything else the model has ingested. Similarly, if you read an entire book yourself, your brain will just store the salient bits, not the entire text, unless you have a photographic memory.

If we take math or computer science for example: some very important algorithms can be compressed to a few bits of information if you (or a model) have a thorough understanding of the surrounding theory to go with it. Would it not amount to IP infringement if a model regurgitates the relevant information from a patent application, even if it is represented by under a kilobyte of information?

replies(1): >>ben_w+v6

>>ben_w+21
If I invent an amazing lossless compression algorithm such that adding an entire 60k word novel to my blob only increases the size by 1.2kb, does that mean I'm not copyright infringing if I release that model?

replies(1): >>Sharli+h6

>>ben_w+21
xz can compress the text of Harry Potter by a factor of 30:1. Does that mean I can also distribute compressed copies of copyrighted works and that's okay?

replies(3): >>ben_w+J6 >>Sharli+R6 >>realus+m7

>>snovv_+J2
How is that relevant? If some LLM were able to regurgitate a 60k word novel verbatim on demand, sure, the copyright situation would be different. But last I checked they can’t, not 60k, 6k, or even 600 words. Perhaps they can do 60 words of some well-known passages from the Bible or other similar ubiquitous copyright-free works.

replies(1): >>snovv_+h42

>>hanswo+d2
I agree with what I think you're saying, so I'm not sure I've understood you.

I think this is all still compatible with saying that ingesting an entire book is still:

> If you're taking a handful of word probabilities from every book ever written, then the portion taken from each work is very, very low

(Though I wouldn't want to make a bet either way on "so courts aren't likely to care" that follows on from that quote: my not-legally-trained interpretation of the rules leads to me being confused about how traditional search engines aren't a copyright violation).

>>andrep+56
Can you get that book out of an LLM?

Because that's the distinction being argued here: it's "a handful"[0] of probabilities, not the complete work.

[0] I'm not sold on the phrasing "a handful", but I don't care enough to argue terminology; the term "handful" feels like it's being used in a sorites paradox kind of way: https://en.wikipedia.org/wiki/Sorites_paradox

>>andrep+56
Incredibly poor analogy. If an LLM were able to regurgitate Harry Potter on demand like xz can, the copyright situation would be much more black and white. But they can’t, and it’s not even close.

>>ben_w+21
To be fair, OP raises an important question that I hope smart legal minds are pondering. In my view, they aren't looking for a "programmer answers about legal issue" response. Probably the right court might agree with their premise. What the damages or restrictions might be, I cannot speculate. Any IP lawyers here who want to share some thoughts?

replies(1): >>ben_w+59

>>andrep+56
You can't get Harry Potter out of the LLM, that's the difference

>>throwa+i7
Yup, that's fair.

As my not-legally-trained interpretation of the rules leads to me being confused about how traditional search engines aren't a copyright violation, I don't trust my own beliefs about the law.

>>Sharli+h6
So the fact that it's a lossy compression algorithm makes it ok?

replies(1): >>ben_w+KW2

>>snovv_+h42
"It's lossy" is in isolation much too vague to say if it's OK or not.

A compression algorithm which loses 1 bit of real data is obviously not going to protect you from copyright infringement claims, something that reduces all inputs to a single bit is obviously fine.

So, for example, what the NYT is suing over is that it (or so it is claimed) allows the model to regenerate entire articles, which is not OK.

But to claim that it is a copyright infringement to "compress" a Harry Potter novel to 1200 bits, is to say that this:

> Harry Potter discovers he is a wizard and attends Hogwarts, where he battles dark forces, including the evil Voldemort, to save the wizarding world.

… which is just under 1200 bits, is an unlawful thing to post (and for the purpose of the hypothetical, imagine that quotation in the form of a zero-context tweet rather than the actual fact of this being a case of fair-use because of its appearance in a discussion about copyright infringement of novels).

I think anyone who suggests suing over this to a lawyer, would discover that lawyers can in fact laugh.

Now, there's also the question of if it's legal or not to train a model on all of the Harry Potter fan wikis, which almost certainly have a huge overlap with the contents of the novels and thus strengthens these same probabilities; some people accuse OpenAI et al of "copyright laundering", and I think ingesting derivative works such as fan sites would be a better description of "copyright laundering" than the specific things they're formally accused of in the lawsuits.