zlacker

[parent] [thread] 25 comments
1. apante+(OP)[view] [source] 2023-12-27 16:07:49
A neural net is not a database where the original source is sitting somewhere in an obvious place with a reference. A neural net is a black box of functions that have been automatically fit to the training data. There is no way to know what sources have been memorized vs which have made their mark by affecting other types of functions in the neural net.
replies(3): >>aantix+01 >>layer8+V5 >>dlandi+u7
2. aantix+01[view] [source] 2023-12-27 16:14:31
>>apante+(OP)
It's possible. Perplexity.ai is trying to solve this problem.

E.g. "Japan's App Store antitrust case"

https://www.perplexity.ai/search/Japans-App-Store-GJNTsIOVSy...

replies(2): >>Philpa+c3 >>simonw+m3
◧◩
3. Philpa+c3[view] [source] [discussion] 2023-12-27 16:26:23
>>aantix+01
That’s not the same thing. Perplexity is using an already-trained LLM to read those sources and synthesise a new result from them. This allows them to cite the sources used for generation.

LLM training sees these documents without context; it doesn’t know where they came from, and any such attribution would become part of the thing it’s trying to mimic.

It’s still largely an unsolved problem.

◧◩
4. simonw+m3[view] [source] [discussion] 2023-12-27 16:27:06
>>aantix+01
That's a different approach: they've implemented RAG, Retrieval Augmented Generation, where the tool runs additional searches as part of answering a question.

ChatGPT Browse and Bing and Google Bard implement the same pattern.

RAG does allow for some citation, but it doesn't help with the larger problem of not being able to cite for answers provided by the unassisted language model.

5. layer8+V5[view] [source] 2023-12-27 16:40:28
>>apante+(OP)
Presumably, if a passage of any significant length is cited verbatim (or almost verbatim), there would have been a way to track that source through the weights.

The issue of replicating a style is probably more difficult.

replies(1): >>solvei+u9
6. dlandi+u7[view] [source] 2023-12-27 16:47:17
>>apante+(OP)
> There is no way to know what sources have been memorized vs which have made their mark by affecting other types of functions in the neural net.

But if it's possible for the neural net to memorize passages of text then surely it could also memorize where it got those passages of text from. Perhaps not with today's exact models and technology, but if it was a requirement then someone would figure out a way to do it.

replies(2): >>wrs+oa >>Tao330+6f
◧◩
7. solvei+u9[view] [source] [discussion] 2023-12-27 16:58:38
>>layer8+V5
> Presumably, if a passage of any significant length is cited verbatim (or almost verbatim), there would have been a way to track that source through the weights.

Figure this out and you get to choose which AI lab you want to make seven figures at. It's a really difficult problem.

replies(2): >>layer8+nb >>photon+Jx
◧◩
8. wrs+oa[view] [source] [discussion] 2023-12-27 17:04:01
>>dlandi+u7
Except it doesn’t memorize text. It generates text that is statistically likely. Generating a citation that is statistically likely wouldn’t really help the problem.
replies(1): >>__loam+mw1
◧◩◪
9. layer8+nb[view] [source] [discussion] 2023-12-27 17:09:48
>>solvei+u9
It’s likely first and foremost a resource problem. “How much different would the output be if that text hadn’t been part of the training data” can _in principle_ be answered by instead of training one model, training N models where N is the number of texts in the training data, omitting text i from the training data of model i, and then when using the model(s), run all N models in parallel and apply some distance metric on their outputs. In case of a verbatim quote, at least one of the models will stand out in that comparison, allowing to infer the source. The difficulty would be in finding a way to do something along those lines efficiently enough to be practical.
replies(1): >>spunke+Sd
◧◩◪◨
10. spunke+Sd[view] [source] [discussion] 2023-12-27 17:24:33
>>layer8+nb
each llm costs ($10-100) millions to train x billions of trainings data ~= $100 quadrillion dollars, so that is unofortunately out of reach of most countries.
◧◩
11. Tao330+6f[view] [source] [discussion] 2023-12-27 17:30:36
>>dlandi+u7
Neural nets don't memorize passages of text. They train on vectorized tokens. You get a model of how language statistically works, not understanding and memory.
replies(2): >>FredPr+Ih >>tsimio+ur
◧◩◪
12. FredPr+Ih[view] [source] [discussion] 2023-12-27 17:46:25
>>Tao330+6f
You can encode understanding in a vector.

To use Andrew Ng's example, you have build a multi-dimensional arrow representing "king". You compare it to the arrow for "queen" and you see that it's almost identical, except it points in the opposite direction in the gender dimension. Compare it to "man" and you see that "king" and "man" have some things in common, but "man" is a broader term.

That's getting really close to understanding as far as I'm concerned; especially if you have a large number of such arrows. It's statistical in a literal sense, but it's more like the computer used statistics to work out the meaning of each word by a process of elimination and now actually understands it.

◧◩◪
13. tsimio+ur[view] [source] [discussion] 2023-12-27 18:38:07
>>Tao330+6f
The model weights clearly encode certain full passages of text, otherwise it would be virtually impossible for the network to produce verbatim copies of text. The format is something very vaguely like "the most likely token after "call" is "me"; the most likely token after "call me" is "Ishmael". It's ultimately a kind of lossy statistical compression scheme at some level.
replies(1): >>photon+fu
◧◩◪◨
14. photon+fu[view] [source] [discussion] 2023-12-27 18:53:01
>>tsimio+ur
> It's ultimately a kind of lossy statistical compression scheme at some level.

And on this subject, it seems worthwhile to note that compression has never freed anyone from copyright/piracy considerations before. If I record a movie with a cell phone at a worse quality, that doesn't change things. If a book is copied and stored in some gzipped format where I can only read a page at a time, or only read a random page at a time, I don't think that's suddenly fair-use.

Not saying these things are exactly the same as what LLMs do, but it's worth some thought, because how are we going to make consistent rules that apply in one case but not the other?

replies(2): >>seanmc+Mu >>fennec+7q3
◧◩◪◨⬒
15. seanmc+Mu[view] [source] [discussion] 2023-12-27 18:55:24
>>photon+fu
If you watch a bunch of movies then go on to make your own movie based on influence from these movies, you are protected even if you have mentally compressed them into your own movie. At some point, you can learn, be influenced and be inspired from copyrighted material (not copyright infringement), and at some point you are just making a poor copy of the material (definitely copyright infringement). LLMs are probably still at the latter case than the former, but eventually AI will reach the former case.
replies(2): >>photon+nF >>tremon+6I
◧◩◪
16. photon+Jx[view] [source] [discussion] 2023-12-27 19:12:16
>>solvei+u9
> Figure this out and you get to choose which AI lab you want to make seven figures at. It's a really difficult problem.

It doesn't have to be perfect to be helpful, and even something that is very imperfect would at least send the signal that model-owners give a shit about attribution in general.

Given a specific output, it might be hard to say which sections of the very large weighted network were tickled during the output, and what inputs were used to build that section of the network. But this level of "citation resolution" is not always what people are necessarily interested in. If an LLM is giving medical advice, I might want to at least know whether it's reading medical journals or facebook posts. If it's political advice/summary/synthesis, it might be relevant to know how much it's been reading Marx vs Lenin or whatever. Pin-pointing original paragraphs as sources would be great, but for most models it's not like there's anything that's very clear about the input datasets.

EDIT: Building on this a bit, a lot of people are really worried about AI "poisoning the well" such that they are retraining on content generated by other AIs so that algorithmic feeds can trash the next-gen internet even worse than the current one. This shows that attribution-sourcing even at the basic level of "only human generated content is used in this model" can be useful and confidence-inspiring.

◧◩◪◨⬒⬓
17. photon+nF[view] [source] [discussion] 2023-12-27 19:55:12
>>seanmc+Mu
There's no obvious need to hold people / AI to same standards here, yet, even if compression in mental-models is exactly analogous to compression in machine-models. I guess we decided already that corporations are already "like" persons legally, but the jury is still out on AIs. Perhaps people should be allowed more leeway to make possibly-questionable derivative works, because they have lives to live, and genuine if misguided creative urges, and bills to pay, etc. Obviously it's quite difficult to try and answer the exact point at which synthesis & summary cross a line to become "original content". But it seems to me that, if anything, machines should be held to higher standard than people.

Even if LLMs can't cite their influences with current technology, that can't be a free pass to continue things this way. Of course all data brokers resist efforts along the lines of data-lineage for themselves and they want to require it from others. Besides copyright, it's common for datasets to have all kinds of other legal encumbrances like "after paying for this dataset, you can do anything you want with it, excepting JOINs with this other dataset". Lineage is expensive and difficult but not impossible. Statements like "we're not doing data-lineage and wish we didn't have to" are always more about business operations and desired profit margins than technical feasibility.

replies(1): >>seanmc+IY
◧◩◪◨⬒⬓
18. tremon+6I[view] [source] [discussion] 2023-12-27 20:08:08
>>seanmc+Mu
But that's not what ChatGPT is doing, or is it? ChatGPT watches and records a bunch of movies, then stitches together its own movie using scenes and frames from the movies it recorded. AI will never reach the former case until it learns to operate a camera.
replies(1): >>seanmc+ZK
◧◩◪◨⬒⬓⬔
19. seanmc+ZK[view] [source] [discussion] 2023-12-27 20:20:47
>>tremon+6I
How do you not know this isn’t what we are doing in some more advanced form? Anyways, the comparisons will become more apt as the tech advances.
◧◩◪◨⬒⬓⬔
20. seanmc+IY[view] [source] [discussion] 2023-12-27 21:39:00
>>photon+nF
> But it seems to me that, if anything, machines should be held to higher standard than people.

If machines achieve sentience, does this still hold? Like, we have to license material for our sentient AI to learn from? They can't just watch a movie or read a book like a normal human could without having the ability to more easily have that material influence new derived works (unlike say Eragon, which is shamelessly Star Wars/Harry Potter/LOTR with dragons).

It will be fun to trip through these questions over the next 20 years.

replies(1): >>Jensso+R91
◧◩◪◨⬒⬓⬔⧯
21. Jensso+R91[view] [source] [discussion] 2023-12-27 22:43:38
>>seanmc+IY
As long as machines needs to leech on human creativity those humans needs to be paid somehow. The human ecosystem works fine thanks to the limitations of humans. A machine that could copy things with no abandon however could easily disrupt this ecosystem resulting in less new things being created in total, it just leeches without paying anything back unlike humans.

If we make a machine that is capable of being as creative as humans and train it to coexist in that ecosystem then it would be fine. But that is a very unlikely case, it is much easier to make a dumb bot that plagiarizes content than to make something as creative as a human.

replies(1): >>seanmc+8k1
◧◩◪◨⬒⬓⬔⧯▣
22. seanmc+8k1[view] [source] [discussion] 2023-12-28 00:02:12
>>Jensso+R91
> If we make a machine that is capable of being as creative as humans and train it to coexist in that ecosystem then it would be fine. But that is a very unlikely case, it is much easier to make a dumb bot that plagiarizes content than to make something as creative as a human.

I disagree that our own creativity doesn't work that way: nothing is very original, our current art is based on 100k years of building up from when cave man would scrawl simple art into the stone (which they copied from nature). We are built for plagiarism, and only gross plagiarism is seen as immoral. Or perhaps, we generalize over several different sources, diluting plagiarism with abstraction?

We are still in the early days of this tech, we will be having very different conversations about it even as soon as 5 years later.

◧◩◪
23. __loam+mw1[view] [source] [discussion] 2023-12-28 02:06:20
>>wrs+oa
So it's just bullshit then.
replies(1): >>fennec+Cp3
◧◩◪◨
24. fennec+Cp3[view] [source] [discussion] 2023-12-28 18:09:27
>>__loam+mw1
It's literally how our meat bag brains work pretty much.

Anything like word association games are basically the same exercise, but with humans and hell, I bet I could play a word association game with an LLM, too.

◧◩◪◨⬒
25. fennec+7q3[view] [source] [discussion] 2023-12-28 18:12:01
>>photon+fu
Is it still compression if I read Tolkien and reference similar or exact concepts when writing my own works?

Having a magical ring in my book after I've read lord of the rings, is that copyright?

replies(1): >>tsimio+wW4
◧◩◪◨⬒⬓
26. tsimio+wW4[view] [source] [discussion] 2023-12-29 06:39:18
>>fennec+7q3
Generally, no, copyright deals with exact expression, not concepts. However, that can include the structure of a work, so if you wrote a book about little people who form a band together with humans and fairies and a mage to destroy a ring of power created by an ancient evil, where the start in their nice home but it gets attacked by the evil lord's knights [...] you may be breaking Tolkien's copyright.
[go to top]