It's up to you if that counts as "a handful" or not.
If we take math or computer science for example: some very important algorithms can be compressed to a few bits of information if you (or a model) have a thorough understanding of the surrounding theory to go with it. Would it not amount to IP infringement if a model regurgitates the relevant information from a patent application, even if it is represented by under a kilobyte of information?
I think this is all still compatible with saying that ingesting an entire book is still:
> If you're taking a handful of word probabilities from every book ever written, then the portion taken from each work is very, very low
(Though I wouldn't want to make a bet either way on "so courts aren't likely to care" that follows on from that quote: my not-legally-trained interpretation of the rules leads to me being confused about how traditional search engines aren't a copyright violation).
Because that's the distinction being argued here: it's "a handful"[0] of probabilities, not the complete work.
[0] I'm not sold on the phrasing "a handful", but I don't care enough to argue terminology; the term "handful" feels like it's being used in a sorites paradox kind of way: https://en.wikipedia.org/wiki/Sorites_paradox
As my not-legally-trained interpretation of the rules leads to me being confused about how traditional search engines aren't a copyright violation, I don't trust my own beliefs about the law.
A compression algorithm which loses 1 bit of real data is obviously not going to protect you from copyright infringement claims, something that reduces all inputs to a single bit is obviously fine.
So, for example, what the NYT is suing over is that it (or so it is claimed) allows the model to regenerate entire articles, which is not OK.
But to claim that it is a copyright infringement to "compress" a Harry Potter novel to 1200 bits, is to say that this:
> Harry Potter discovers he is a wizard and attends Hogwarts, where he battles dark forces, including the evil Voldemort, to save the wizarding world.
… which is just under 1200 bits, is an unlawful thing to post (and for the purpose of the hypothetical, imagine that quotation in the form of a zero-context tweet rather than the actual fact of this being a case of fair-use because of its appearance in a discussion about copyright infringement of novels).
I think anyone who suggests suing over this to a lawyer, would discover that lawyers can in fact laugh.
Now, there's also the question of if it's legal or not to train a model on all of the Harry Potter fan wikis, which almost certainly have a huge overlap with the contents of the novels and thus strengthens these same probabilities; some people accuse OpenAI et al of "copyright laundering", and I think ingesting derivative works such as fan sites would be a better description of "copyright laundering" than the specific things they're formally accused of in the lawsuits.