zlacker

> And the metadata (metaknowledge?) would be larger than the knowledge itself.

Because URLs are usually as long as the writing they point at?

replies(1): >>ahepp+L1

>>photon+(OP)
I’m not an expert in AI training, but I don’t think it’s as simple as storing writing. It does seem to be possible to get the system to regurgitate training material verbatim in some cases, but my understanding is that the text is generated probabilistically.

It seems like a very difficult engineering challenge to provide attribution for content generated by LLMs, while preserving the traits that make them more useful than a “mere” search engine.

Which is to say nothing about whether that challenge is worth taking on.

replies(2): >>tsimio+65 >>photon+r6

>>ahepp+L1
Conceptually, it wouldn't be very hard to take the candidate output and run it through a text matching phase to see if there are ~exact matches in the training corpus, and generate other output if there are (probably limited to the parts of the training corpus where rights couldn't be obtained normally). Of course, it would be quite compute heavy, so it would add significantly to the cost per query.

replies(1): >>TheCor+K9

>>ahepp+L1
Sure, it's a hard problem, but as others have pointed out frequently in this thread.. there is not only "no incentive" to solve it but a clear disincentive. If one can say where the data comes from, one might have to prove that it was used only with permission. And the reason why it's a hard problem is not related to metadata volume being greater than content volume. Clearly a book title/year published is usually shorter than book contents.

>>tsimio+65
GitHub Copilot supports that:

https://docs.github.com/en/copilot/configuring-github-copilo...

Given how cheap text search is compared with LLM inference, and that GitHub reuses the same infrastructure for its code search, I doubt it adds more than 1% to the total cost.

replies(1): >>edwint+wR1

>>TheCor+K9
It is questionable whether that filtering mechanism works, previous discussion: >>33226515

But even if it did an exact match search is not enough here. What if you take the source code and rename all variables and functions? The filter wouldn't trigger, but it'd still be copyright infringement (whether a human or a machine does that).

For such a filter to be effective it'd at least have to build a canonical representation of the program's AST and then check for similarities with existing programs. Doing that at scale would be challenging.

Wouldn't it be better to: * Either not include copyrighted content in the training material in the first place * Explicitly tag the training material with license and origin infornation, such that the final output can produce a proof of what training material was relevant for producing that output and don't mix differently licensed content.