The New York Times is suing OpenAI and Microsoft for copyright infringement

>>ssgodd+(OP)
I hope this results in Fair Use being expanded to cover AI training. This is way more important to humanity's future than any single media outlet. If the NYT goes under, a dozen similar outlets can replace them overnight. If we lose AI to stupid IP battles in its infancy, we end up handicapping probably the single most important development in human history just to protect some ancient newspaper. Then another country is going to do it anyway, and still the NYT is going to get eaten.

>>solard+Aj
Why can't AI at least cite its source? This feels like a broader problem, nothing specific to the NYTimes.

Long term, if no one is given credit for their research, either the creators will start to wall off their content or not create at all. Both options would be sad.

A humane attribution comment from the AI could go a long way - "I think I read something about this <topic X> in the NYTimes <link> on January 3rd, 2021."

It appears that without attribution, long term, nothing moves forward.

AI loses access to the latest findings from humanity. And so does the public.

>>aantix+1l
A neural net is not a database where the original source is sitting somewhere in an obvious place with a reference. A neural net is a black box of functions that have been automatically fit to the training data. There is no way to know what sources have been memorized vs which have made their mark by affecting other types of functions in the neural net.

>>apante+6m
> There is no way to know what sources have been memorized vs which have made their mark by affecting other types of functions in the neural net.

But if it's possible for the neural net to memorize passages of text then surely it could also memorize where it got those passages of text from. Perhaps not with today's exact models and technology, but if it was a requirement then someone would figure out a way to do it.

>>dlandi+At
Neural nets don't memorize passages of text. They train on vectorized tokens. You get a model of how language statistically works, not understanding and memory.

>>Tao330+cB
The model weights clearly encode certain full passages of text, otherwise it would be virtually impossible for the network to produce verbatim copies of text. The format is something very vaguely like "the most likely token after "call" is "me"; the most likely token after "call me" is "Ishmael". It's ultimately a kind of lossy statistical compression scheme at some level.

>>tsimio+AN
> It's ultimately a kind of lossy statistical compression scheme at some level.

And on this subject, it seems worthwhile to note that compression has never freed anyone from copyright/piracy considerations before. If I record a movie with a cell phone at a worse quality, that doesn't change things. If a book is copied and stored in some gzipped format where I can only read a page at a time, or only read a random page at a time, I don't think that's suddenly fair-use.

Not saying these things are exactly the same as what LLMs do, but it's worth some thought, because how are we going to make consistent rules that apply in one case but not the other?

>>photon+lQ
Is it still compression if I read Tolkien and reference similar or exact concepts when writing my own works?

Having a magical ring in my book after I've read lord of the rings, is that copyright?

zlacker