zlacker

A human can't credit the source of each element of everything they've learnt. AI's can't either, and for the same reason.

The knowledge gets distorted, blended, and reinterpreted a million ways by the time it's given as output.

And the metadata (metaknowledge?) would be larger than the knowledge itself. The AI learnt every single concept it knows by reading online; including the structure of grammar, rules of logic, the meaning of words, how they relate to one another. You simply couldn't cite it all.

replies(3): >>photon+g4 >>anigbr+li >>ahepp+co

>>FredPr+(OP)
> And the metadata (metaknowledge?) would be larger than the knowledge itself.

Because URLs are usually as long as the writing they point at?

replies(1): >>ahepp+16

>>photon+g4
I’m not an expert in AI training, but I don’t think it’s as simple as storing writing. It does seem to be possible to get the system to regurgitate training material verbatim in some cases, but my understanding is that the text is generated probabilistically.

It seems like a very difficult engineering challenge to provide attribution for content generated by LLMs, while preserving the traits that make them more useful than a “mere” search engine.

Which is to say nothing about whether that challenge is worth taking on.

replies(2): >>tsimio+m9 >>photon+Ha

>>ahepp+16
Conceptually, it wouldn't be very hard to take the candidate output and run it through a text matching phase to see if there are ~exact matches in the training corpus, and generate other output if there are (probably limited to the parts of the training corpus where rights couldn't be obtained normally). Of course, it would be quite compute heavy, so it would add significantly to the cost per query.

replies(1): >>TheCor+0e

>>ahepp+16
Sure, it's a hard problem, but as others have pointed out frequently in this thread.. there is not only "no incentive" to solve it but a clear disincentive. If one can say where the data comes from, one might have to prove that it was used only with permission. And the reason why it's a hard problem is not related to metadata volume being greater than content volume. Clearly a book title/year published is usually shorter than book contents.

>>tsimio+m9
GitHub Copilot supports that:

https://docs.github.com/en/copilot/configuring-github-copilo...

Given how cheap text search is compared with LLM inference, and that GitHub reuses the same infrastructure for its code search, I doubt it adds more than 1% to the total cost.

replies(1): >>edwint+MV1

>>FredPr+(OP)
Of course not, but you can cite where specific facts or theories were first published. Now, I don't think that not doing so infringes any copyright interest or that doing so creates any liability, any more than if I cited to a scientific paper or public statement of opinion by someone else.

>>FredPr+(OP)
At the same time, there are situations where humans are expected to provide sources for their claims. If you talk about an event in the news, it would be normal for me to ask where you heard about it. 100% accuracy in providing a source wouldn’t be expected, but if you told me you had no idea, or told me something obviously nonsense, I would probably take what you said less seriously.

replies(1): >>fennec+883

>>TheCor+0e
It is questionable whether that filtering mechanism works, previous discussion: >>33226515

But even if it did an exact match search is not enough here. What if you take the source code and rename all variables and functions? The filter wouldn't trigger, but it'd still be copyright infringement (whether a human or a machine does that).

For such a filter to be effective it'd at least have to build a canonical representation of the program's AST and then check for similarities with existing programs. Doing that at scale would be challenging.

Wouldn't it be better to: * Either not include copyrighted content in the training material in the first place * Explicitly tag the training material with license and origin infornation, such that the final output can produce a proof of what training material was relevant for producing that output and don't mix differently licensed content.

>>ahepp+co
The raw technology behind it literally cannot do that.

The model is fuzzy, it's the learning part, it'll never follow the rules to the letter the same as humans fuck up all the time.

But a model trained to be literate and parse meaning could be provided with the hard data via a vector DB or similar, it can cite sources from there or as it finds them via the internet and tbf this is how they should've trained the model.

But in order to become literate, it needs to read...and us humans reuse phrases etc we've picked up all the time "as easy as pie" oops, copyright.

replies(1): >>ahepp+ba4

>>fennec+883
I agree that the model being fuzzy is key aspect of an LLM. It doesn't sound like we're just talking about re-using phrases though. "Simple as pie" is not under copyright. We're talking about the "knowledge" that the model has obtained and in some cases spits out verbatim without attribution.

I wonder if there's any possibility to train the model on a wide variety of sources, only for language function purposes, then as you say give it a separate knowledge vector.

replies(1): >>fennec+Cl4

>>ahepp+ba4
Sure, it definitely spits out facts, often not hallucinating. And it can reiterate titles and small chunks of copyright text.

But I still haven't seen a real example of it spitting out a book verbatim. You know where I think it got chunks of "copyright" text from GRRM's books?

Wikipedia. And https://gameofthrones.fandom.com/wiki/Wiki_of_Westeros, https://awoiaf.westeros.org/index.php/Main_Page, https://data.world/datasets/game-of-thrones all the god dammed wikis, databases etc based on his work, of which there are many, and of which most quote sections or whole passages of the books.

Someone prove to me that GPT can reproduce enough text verbatim that it makes it clear that it was trained on the original text first hand basis, rather than second hand from other sources.