zlacker

[parent] [thread] 1 comments
1. layer8+(OP)[view] [source] 2023-12-27 17:09:48
It’s likely first and foremost a resource problem. “How much different would the output be if that text hadn’t been part of the training data” can _in principle_ be answered by instead of training one model, training N models where N is the number of texts in the training data, omitting text i from the training data of model i, and then when using the model(s), run all N models in parallel and apply some distance metric on their outputs. In case of a verbatim quote, at least one of the models will stand out in that comparison, allowing to infer the source. The difficulty would be in finding a way to do something along those lines efficiently enough to be practical.
replies(1): >>spunke+v2
2. spunke+v2[view] [source] 2023-12-27 17:24:33
>>layer8+(OP)
each llm costs ($10-100) millions to train x billions of trainings data ~= $100 quadrillion dollars, so that is unofortunately out of reach of most countries.
[go to top]