zlacker

If you're going to consider training ai as fair use, you'll have all kinds of different people with different skill levels training ais that work in different ways on the corpus.

Not all of them will have the capability to cite a source, and plenty of them won't have it make sense to cite a source.

Eg. Suppose I train a regression that guesses how many words will be in a book.

Which book do I cite when I do an inference? All of them?

replies(2): >>aantix+t1 >>benrow+6j

>>8note+(OP)
Any citation would be a good start.

For complex subjects, I'm sure the citation page would be large, and a count would be displayed demonstrating the depth of the subject[3].

This is how Google did it with search results in the early days[1]. Most probable to least probable, in terms of the relevancy of the page. With a count of all possible results [2].

The same attempt should be made for citations.

replies(1): >>jquery+L7

>>aantix+t1
Ok, now please cite the source of this comment you just made. It's okay if the citation list is large, just list your citations from most probably to the least probable.

replies(1): >>aantix+Xa

>>jquery+L7
"Now displaying 3 citations out of ~150,000,000.."

[1] http://web.archive.org/web/20120608192927/http://www.google....

[2] https://steemit.com/online/@jaroli/how-google-search-result-...

[3] https://www.smashingmagazine.com/2009/09/search-results-desi...

[4] Next page

:)

replies(1): >>pama+3h

>>aantix+Xa
This is not answering the GP question and does not count as a satisfactory ranked citation list. The first one is particularly dubious. Also you didn’t clarify which statement was based on which citation. I didn’t see “dog” in your text.

To help understand the complexity of an LLM consider that these models typically hold about 10,000 less parameters than the total characters in the training data. If one wants to instruct the LLM to search the web and find relevant citations it might obey this command but it will not be the source of how it formed the opinions it has in order to produce its output.

replies(1): >>jquery+47a

>>8note+(OP)
Regression is a good analogy of the problem here. If you found a line of best fit for some datapoints, how would you get back the original datapoints, from the line?

Now imagine terabytes worth of datapoints, and thousands of dimensions rather than two.

>>pama+3h
You mean 10,000x less parameters? In other words, only 1 character for every 10,000 characters of input?

Yeah, good luck embedding citations into that. Everyone here saying it's easy needs to go earn their 7 figure comp at an AI company instead of wasting their time educating us dummies.