The New York Times is suing OpenAI and Microsoft for copyright infringement

>>ssgodd+(OP)
I hope this results in Fair Use being expanded to cover AI training. This is way more important to humanity's future than any single media outlet. If the NYT goes under, a dozen similar outlets can replace them overnight. If we lose AI to stupid IP battles in its infancy, we end up handicapping probably the single most important development in human history just to protect some ancient newspaper. Then another country is going to do it anyway, and still the NYT is going to get eaten.

>>solard+Aj
Why can't AI at least cite its source? This feels like a broader problem, nothing specific to the NYTimes.

Long term, if no one is given credit for their research, either the creators will start to wall off their content or not create at all. Both options would be sad.

A humane attribution comment from the AI could go a long way - "I think I read something about this <topic X> in the NYTimes <link> on January 3rd, 2021."

It appears that without attribution, long term, nothing moves forward.

AI loses access to the latest findings from humanity. And so does the public.

>>aantix+1l
A neural net is not a database where the original source is sitting somewhere in an obvious place with a reference. A neural net is a black box of functions that have been automatically fit to the training data. There is no way to know what sources have been memorized vs which have made their mark by affecting other types of functions in the neural net.

>>apante+6m
> There is no way to know what sources have been memorized vs which have made their mark by affecting other types of functions in the neural net.

But if it's possible for the neural net to memorize passages of text then surely it could also memorize where it got those passages of text from. Perhaps not with today's exact models and technology, but if it was a requirement then someone would figure out a way to do it.

>>dlandi+At
Neural nets don't memorize passages of text. They train on vectorized tokens. You get a model of how language statistically works, not understanding and memory.

>>Tao330+cB
The model weights clearly encode certain full passages of text, otherwise it would be virtually impossible for the network to produce verbatim copies of text. The format is something very vaguely like "the most likely token after "call" is "me"; the most likely token after "call me" is "Ishmael". It's ultimately a kind of lossy statistical compression scheme at some level.

>>tsimio+AN
> It's ultimately a kind of lossy statistical compression scheme at some level.

And on this subject, it seems worthwhile to note that compression has never freed anyone from copyright/piracy considerations before. If I record a movie with a cell phone at a worse quality, that doesn't change things. If a book is copied and stored in some gzipped format where I can only read a page at a time, or only read a random page at a time, I don't think that's suddenly fair-use.

Not saying these things are exactly the same as what LLMs do, but it's worth some thought, because how are we going to make consistent rules that apply in one case but not the other?

>>photon+lQ
If you watch a bunch of movies then go on to make your own movie based on influence from these movies, you are protected even if you have mentally compressed them into your own movie. At some point, you can learn, be influenced and be inspired from copyrighted material (not copyright infringement), and at some point you are just making a poor copy of the material (definitely copyright infringement). LLMs are probably still at the latter case than the former, but eventually AI will reach the former case.

>>seanmc+SQ
There's no obvious need to hold people / AI to same standards here, yet, even if compression in mental-models is exactly analogous to compression in machine-models. I guess we decided already that corporations are already "like" persons legally, but the jury is still out on AIs. Perhaps people should be allowed more leeway to make possibly-questionable derivative works, because they have lives to live, and genuine if misguided creative urges, and bills to pay, etc. Obviously it's quite difficult to try and answer the exact point at which synthesis & summary cross a line to become "original content". But it seems to me that, if anything, machines should be held to higher standard than people.

Even if LLMs can't cite their influences with current technology, that can't be a free pass to continue things this way. Of course all data brokers resist efforts along the lines of data-lineage for themselves and they want to require it from others. Besides copyright, it's common for datasets to have all kinds of other legal encumbrances like "after paying for this dataset, you can do anything you want with it, excepting JOINs with this other dataset". Lineage is expensive and difficult but not impossible. Statements like "we're not doing data-lineage and wish we didn't have to" are always more about business operations and desired profit margins than technical feasibility.

>>photon+t11
> But it seems to me that, if anything, machines should be held to higher standard than people.

If machines achieve sentience, does this still hold? Like, we have to license material for our sentient AI to learn from? They can't just watch a movie or read a book like a normal human could without having the ability to more easily have that material influence new derived works (unlike say Eragon, which is shamelessly Star Wars/Harry Potter/LOTR with dragons).

It will be fun to trip through these questions over the next 20 years.

>>seanmc+Ok1
As long as machines needs to leech on human creativity those humans needs to be paid somehow. The human ecosystem works fine thanks to the limitations of humans. A machine that could copy things with no abandon however could easily disrupt this ecosystem resulting in less new things being created in total, it just leeches without paying anything back unlike humans.

If we make a machine that is capable of being as creative as humans and train it to coexist in that ecosystem then it would be fine. But that is a very unlikely case, it is much easier to make a dumb bot that plagiarizes content than to make something as creative as a human.

>>Jensso+Xv1
> If we make a machine that is capable of being as creative as humans and train it to coexist in that ecosystem then it would be fine. But that is a very unlikely case, it is much easier to make a dumb bot that plagiarizes content than to make something as creative as a human.

I disagree that our own creativity doesn't work that way: nothing is very original, our current art is based on 100k years of building up from when cave man would scrawl simple art into the stone (which they copied from nature). We are built for plagiarism, and only gross plagiarism is seen as immoral. Or perhaps, we generalize over several different sources, diluting plagiarism with abstraction?

We are still in the early days of this tech, we will be having very different conversations about it even as soon as 5 years later.

zlacker