The New York Times is suing OpenAI and Microsoft for copyright infringement

>>ssgodd+(OP)
I hope this results in Fair Use being expanded to cover AI training. This is way more important to humanity's future than any single media outlet. If the NYT goes under, a dozen similar outlets can replace them overnight. If we lose AI to stupid IP battles in its infancy, we end up handicapping probably the single most important development in human history just to protect some ancient newspaper. Then another country is going to do it anyway, and still the NYT is going to get eaten.

>>solard+Aj
Why can't AI at least cite its source? This feels like a broader problem, nothing specific to the NYTimes.

Long term, if no one is given credit for their research, either the creators will start to wall off their content or not create at all. Both options would be sad.

A humane attribution comment from the AI could go a long way - "I think I read something about this <topic X> in the NYTimes <link> on January 3rd, 2021."

It appears that without attribution, long term, nothing moves forward.

AI loses access to the latest findings from humanity. And so does the public.

>>aantix+1l
A human can't credit the source of each element of everything they've learnt. AI's can't either, and for the same reason.

The knowledge gets distorted, blended, and reinterpreted a million ways by the time it's given as output.

And the metadata (metaknowledge?) would be larger than the knowledge itself. The AI learnt every single concept it knows by reading online; including the structure of grammar, rules of logic, the meaning of words, how they relate to one another. You simply couldn't cite it all.

>>FredPr+7D
> And the metadata (metaknowledge?) would be larger than the knowledge itself.

Because URLs are usually as long as the writing they point at?

>>photon+nH
I’m not an expert in AI training, but I don’t think it’s as simple as storing writing. It does seem to be possible to get the system to regurgitate training material verbatim in some cases, but my understanding is that the text is generated probabilistically.

It seems like a very difficult engineering challenge to provide attribution for content generated by LLMs, while preserving the traits that make them more useful than a “mere” search engine.

Which is to say nothing about whether that challenge is worth taking on.

zlacker