I'm not sure how we should treat LLMs with respect to publicly accessible but copyrighted material, but it seems clear to me that "profiting" from copyrighted material isn't a sufficient criteria to cause me to "owe something to the owner".
A computer isn't a human, and we already have laws that have a different effect depending on if it's a computer doing it or a human. LLMs are no different, no matter how catchy hyping them up as being == Humans may be.
My understanding is that GPT is a word probability lookup table based on a review of the training material. A statistical analysis of NYT is not copying.
And this doesn't even to look at whether fair use might apply. Since tabulating word frequencies isn't copying, GPT isn't violating anyone's copyright.
If you want to assert that groups of people that build and operate LLMs should operate under a different set of laws and regulations than individuals that read books in the library regarding "profit", I'm open to that idea. But that is not at all the same as "anthropomorphizing these AI black boxes".
Furthermore, if we manage to "untrain" AI on certain pieces of content, then copyright would really become "brain" damage too. Like, the perceptrons and stuff.
[1] https://www.youtube.com/watch?v=XO9FKQAxWZc
[2] No, I'm not an AI, just autistic.
It seems obvious to me that, despite what current law says, there is something not right about what large companies are doing when they create LLMs.
If they are going to build off of humanity's collective work, their product should benefit all of humanity, and not just shareholders.
Now there are (or very, very soon there will be) two members in that set. How do we properly define the rules for members of that set?
If something can learn from reading do ban it from reading copyrighted material, even if it can memorize some of it? Clearly that would be a failure for humans a ban of that form. Should we have that ban for all things that can learn?
There is a reasonable argument that if you want things to learn they have to learn on a wide variety, and on our best works (which are often copyrighted).
And the statements above have no implication of being free of cost (or not), just that I think blocking "learning programs / LLMs" from being able to access, learn from or reproduce copyright text is a net loss for society.
which laws?
we generally accept computers as agents of their owners.
for example, a law that applies to a human travel agent also applies to a computerized travel agency service.
Take a college student who scans all her textbooks, relying on fair use. If she is the only user, is she obligated to pay a premium for mining?
What about the scenario in which she sells that engine to other book owners? What if they only owned the book a short time in school?