Long term, if no one is given credit for their research, either the creators will start to wall off their content or not create at all. Both options would be sad.
A humane attribution comment from the AI could go a long way - "I think I read something about this <topic X> in the NYTimes <link> on January 3rd, 2021."
It appears that without attribution, long term, nothing moves forward.
AI loses access to the latest findings from humanity. And so does the public.
Not all of them will have the capability to cite a source, and plenty of them won't have it make sense to cite a source.
Eg. Suppose I train a regression that guesses how many words will be in a book.
Which book do I cite when I do an inference? All of them?
For complex subjects, I'm sure the citation page would be large, and a count would be displayed demonstrating the depth of the subject[3].
This is how Google did it with search results in the early days[1]. Most probable to least probable, in terms of the relevancy of the page. With a count of all possible results [2].
The same attempt should be made for citations.
[1] http://web.archive.org/web/20120608192927/http://www.google....
[2] https://steemit.com/online/@jaroli/how-google-search-result-...
[3] https://www.smashingmagazine.com/2009/09/search-results-desi...
[4] Next page
:)
To help understand the complexity of an LLM consider that these models typically hold about 10,000 less parameters than the total characters in the training data. If one wants to instruct the LLM to search the web and find relevant citations it might obey this command but it will not be the source of how it formed the opinions it has in order to produce its output.
Yeah, good luck embedding citations into that. Everyone here saying it's easy needs to go earn their 7 figure comp at an AI company instead of wasting their time educating us dummies.