zlacker

I agree that the model being fuzzy is key aspect of an LLM. It doesn't sound like we're just talking about re-using phrases though. "Simple as pie" is not under copyright. We're talking about the "knowledge" that the model has obtained and in some cases spits out verbatim without attribution.

I wonder if there's any possibility to train the model on a wide variety of sources, only for language function purposes, then as you say give it a separate knowledge vector.

replies(1): >>fennec+rb

>>ahepp+(OP)
Sure, it definitely spits out facts, often not hallucinating. And it can reiterate titles and small chunks of copyright text.

But I still haven't seen a real example of it spitting out a book verbatim. You know where I think it got chunks of "copyright" text from GRRM's books?

Wikipedia. And https://gameofthrones.fandom.com/wiki/Wiki_of_Westeros, https://awoiaf.westeros.org/index.php/Main_Page, https://data.world/datasets/game-of-thrones all the god dammed wikis, databases etc based on his work, of which there are many, and of which most quote sections or whole passages of the books.

Someone prove to me that GPT can reproduce enough text verbatim that it makes it clear that it was trained on the original text first hand basis, rather than second hand from other sources.