Honestly, I get this feeling about these lawsuits about using content to train LLMs.
Think of it this way: in growing up and learning to read and getting an education you read any number of books, articles, Web pages, magazines, etc. You viewed any number of artworks, buildings, cars, vehicles, furniture, etc, many of which might have design patents. We have such silliness as it being illegal to distribute photos commercially of the Eiffel Tower at night [2].
What's the differnce between training a model on text and images and educating a person with text and images, really? If I read too many NYT articles, am I going to get sued for using too much "training data"?
Currently we need copious quantities of training data for LLMs. I believe this is because we're in the early days of this tech. I mean no person has read millions of articles or books. At some point models will get better with substantially smaller training sets. And then, how many articles is too many as far as these suits go?
[1]: https://en.wikipedia.org/wiki/Wright_brothers_patent_war
[2]: https://www.travelandleisure.com/photography/illegal-to-take...
"Photographing the Eiffel Tower at night is not illegal at all. Any individual can take photos and share them on social networks. But the situation is different for professionals. The Eiffel Tower's lighting and sparkling lights are protected by copyright, so professional use of images of the Eiffel Tower at night requires prior authorization and may be subject to a fee."