zlacker

[parent] [thread] 93 comments
1. aantix+(OP)[view] [source] 2023-12-27 16:01:23
Why can't AI at least cite its source? This feels like a broader problem, nothing specific to the NYTimes.

Long term, if no one is given credit for their research, either the creators will start to wall off their content or not create at all. Both options would be sad.

A humane attribution comment from the AI could go a long way - "I think I read something about this <topic X> in the NYTimes <link> on January 3rd, 2021."

It appears that without attribution, long term, nothing moves forward.

AI loses access to the latest findings from humanity. And so does the public.

replies(10): >>make3+41 >>apante+51 >>solard+q3 >>8note+r5 >>awwaii+G6 >>Tulliu+f8 >>FredPr+6i >>whichf+aj >>throwu+0l >>devd00+2r
2. make3+41[view] [source] 2023-12-27 16:07:46
>>aantix+(OP)
"Why can't AI at least cite its source" each article seen alters the weights a tiny, non-human understandable amount. it doesn't have a source, unless you think of the whole humongous corpus that it is trained on
replies(3): >>aantix+k2 >>pxoe+V3 >>Foobar+35
3. apante+51[view] [source] 2023-12-27 16:07:49
>>aantix+(OP)
A neural net is not a database where the original source is sitting somewhere in an obvious place with a reference. A neural net is a black box of functions that have been automatically fit to the training data. There is no way to know what sources have been memorized vs which have made their mark by affecting other types of functions in the neural net.
replies(3): >>aantix+52 >>layer8+07 >>dlandi+z8
◧◩
4. aantix+52[view] [source] [discussion] 2023-12-27 16:14:31
>>apante+51
It's possible. Perplexity.ai is trying to solve this problem.

E.g. "Japan's App Store antitrust case"

https://www.perplexity.ai/search/Japans-App-Store-GJNTsIOVSy...

replies(2): >>Philpa+h4 >>simonw+r4
◧◩
5. aantix+k2[view] [source] [discussion] 2023-12-27 16:15:45
>>make3+41
We're trying to solve AGI but can't solve sources/citations?
6. solard+q3[view] [source] 2023-12-27 16:22:08
>>aantix+(OP)
There's a few levels to this...

Would it be more rigorous for AI to cite its sources? Sure, but the same could be said for humans too. Wikipedia editors, scholars, and scientists all still struggle with proper citations. NYT itself has been caught plagiarizing[1].

But that doesn't really solve the underlying issue here: That our copyright laws and monetization models predate the Internet and the ease of sharing/paywall bypass/piracy. The models that made sense when publishing was difficult and required capital-intensive presses don't necessarily make sense in the copy and paste world of today. Whether it's journalists or academics fighting over scraps just for first authorship (while some random web dev makes 3x more money on ad tracking), it's just not a long-term sustainable way to run an information economy.

I'd also argue that attribution isn't really that important to most people to begin with. Stuff, real and fake, gets shared on social media all the time with limited fact-checking (for better or worse). In general, people don't speak in a rigorous scholarly way. And people are often wrong, with faulty memories, or even incentivized falsehoods. Our primate brains aren't constantly in fact-checking mode and we respond better to emotional, plot-driven narratives than cold statistics. There are some intellectuals who really care deeply about attributions, but most humans won't.

Taken the above into consideration:

1) Useful AI does not necessarily require attribution

2) AI piracy is just a continuation of decades of digital piracy, and the solutions that didn't work in the 1990s and 2000s still won't work against AI

3) We need some better way to fund human creativity, especially as it gets more and more commoditized

4) This is going to happen with or without us. Cat's outta the bag.

I don't think using old IP law to hold us back is really going to solve anything in the long term. Yes, it'd be classy of OpenAI to pay everyone it sourced from, but long term that doesn't matter. Creativity has always been shared and copied and imitated and stolen, the only question is whether the creators get compensated (or even enriched) in the meantime. Sometimes yes, sometimes no, but it happens regardless. There'll always be noncommercial posts by the billions of people who don't care if AI, or a search engine, or Twitter, or whoever, profits off them.

If we get anywhere remotely close to AGI, a lot of this won't matter. Our entire economic and legal systems will have to be redone. Maybe we can finally get rid of the capitalist and lawyer classes. Or they'll probably just further enslave the rest of us with the help of their robo-bros, giving AI more rights than poor people.

But either way, this is way bigger than the economics of 19th-century newspapers...

[1] https://en.wikipedia.org/wiki/Jayson_Blair#Plagiarism_and_fa...

replies(1): >>aantix+a4
◧◩
7. pxoe+V3[view] [source] [discussion] 2023-12-27 16:24:44
>>make3+41
that just sounds like "we didn't even try to build those systems in that way, and we're all out of ideas, so it basically will never work"

which is really just a very, very common story with ai problems, be it sources/citations/licenses/usage tracking/etc., it's all just 'too complex if not impossible to solve', which just seems like a facade for intentionally ignoring those problems for benefit at this point. those problems definitely exist, why not try to solve them? because well...actually trying to solve them would entail having to use data properly and pay creators, and that'd just cut into bottom line. the point is free data use without having to pay, so why would they try to ruin that for themselves?

replies(2): >>simonw+O4 >>KHRZ+t7
◧◩
8. aantix+a4[view] [source] [discussion] 2023-12-27 16:25:46
>>solard+q3
Can you imagine spending decades of your life, studying skin cancer, only to have some $20/month ChatGPT index your latest findings and spit out generically to some subpar researcher:

"Here's how I would cure melanoma!" followed by your detailed findings. Zero mention of you.

F-that. Attribution, as best they can, is the least OpenAI can do as a service to humanity. It's a nod to all content creators that they have built their business off of.

Claiming knowledge without even acknowledging potential sources is gross. Solve it OpenAI.

replies(4): >>tansey+Gc >>pama+po >>Levitz+mS >>solard+KU
◧◩◪
9. Philpa+h4[view] [source] [discussion] 2023-12-27 16:26:23
>>aantix+52
That’s not the same thing. Perplexity is using an already-trained LLM to read those sources and synthesise a new result from them. This allows them to cite the sources used for generation.

LLM training sees these documents without context; it doesn’t know where they came from, and any such attribution would become part of the thing it’s trying to mimic.

It’s still largely an unsolved problem.

◧◩◪
10. simonw+r4[view] [source] [discussion] 2023-12-27 16:27:06
>>aantix+52
That's a different approach: they've implemented RAG, Retrieval Augmented Generation, where the tool runs additional searches as part of answering a question.

ChatGPT Browse and Bing and Google Bard implement the same pattern.

RAG does allow for some citation, but it doesn't help with the larger problem of not being able to cite for answers provided by the unassisted language model.

◧◩◪
11. simonw+O4[view] [source] [discussion] 2023-12-27 16:28:28
>>pxoe+V3
What makes you think AI researchers (including the big labs like OpenAI and Anthropic) aren't trying to solve these problems?
replies(1): >>pxoe+Q7
◧◩
12. Foobar+35[view] [source] [discussion] 2023-12-27 16:30:28
>>make3+41
So why my employer implementation version of azure chatgpt on our document systems can successfully cite its sourced documents?
replies(2): >>layer8+g6 >>Tao330+Sn
13. 8note+r5[view] [source] 2023-12-27 16:32:09
>>aantix+(OP)
If you're going to consider training ai as fair use, you'll have all kinds of different people with different skill levels training ais that work in different ways on the corpus.

Not all of them will have the capability to cite a source, and plenty of them won't have it make sense to cite a source.

Eg. Suppose I train a regression that guesses how many words will be in a book.

Which book do I cite when I do an inference? All of them?

replies(2): >>aantix+U6 >>benrow+xo
◧◩◪
14. layer8+g6[view] [source] [discussion] 2023-12-27 16:36:29
>>Foobar+35
Because the model proper wasn’t trained on those documents, it’s just RAG being employed with the documents as external sources. It’s a fundamentally different setup.
15. awwaii+G6[view] [source] 2023-12-27 16:38:39
>>aantix+(OP)
I think the gap between attributable knowledge and absorbed knowledge is pretty difficult to bridge. For news stuff, if I read the same general story from NYT and LA Times and WaPo then I'll start to get confused about which bit I got from which publication. In some ways, being able to verbatim quote long passages is a failure to generalize that should be fixed rather than reinforced.

Though the other way to do it is to clearly document the training data as a whole, even if you can't cite a specific entry in it for a particular bit of generated output. It should get useless quickly though as you'd eventually have one big citation -- "The Internet"

◧◩
16. aantix+U6[view] [source] [discussion] 2023-12-27 16:39:59
>>8note+r5
Any citation would be a good start.

For complex subjects, I'm sure the citation page would be large, and a count would be displayed demonstrating the depth of the subject[3].

This is how Google did it with search results in the early days[1]. Most probable to least probable, in terms of the relevancy of the page. With a count of all possible results [2].

The same attempt should be made for citations.

replies(1): >>jquery+cd
◧◩
17. layer8+07[view] [source] [discussion] 2023-12-27 16:40:28
>>apante+51
Presumably, if a passage of any significant length is cited verbatim (or almost verbatim), there would have been a way to track that source through the weights.

The issue of replicating a style is probably more difficult.

replies(1): >>solvei+za
◧◩◪
18. KHRZ+t7[view] [source] [discussion] 2023-12-27 16:42:31
>>pxoe+V3
Just a question, do you remember a source for all the knowledge in your mind, or did you at least try to remember?
replies(2): >>pxoe+y8 >>bluefi+mf
◧◩◪◨
19. pxoe+Q7[view] [source] [discussion] 2023-12-27 16:44:14
>>simonw+O4
the solutions haven't arrived. neither have changes in lieu of having solutions. "trying" isn't an actual, present, functional change. and it just gets passed around as an excuse for companies to keep doing whatever they're doing.
replies(1): >>pama+Tn
20. Tulliu+f8[view] [source] 2023-12-27 16:45:57
>>aantix+(OP)
> Why can't AI at least cite its source?

Because AI models aren't databases.

◧◩◪◨
21. pxoe+y8[view] [source] [discussion] 2023-12-27 16:47:13
>>KHRZ+t7
a computer isn't a human. aren't computers good at storing data? why can't they just store that data? they literally have sources in datasets. why can't they just reference those sources?

human analogies are cute, but they're completely irrelevant. it doesn't change that it's specifically about computers, and doesn't change or excuse how computers work.

replies(8): >>umvi+yb >>wrs+Qb >>qup+Yb >>jquery+Ec >>KHRZ+gf >>Tao330+7l >>Kim_Br+Gl >>Levitz+MQ
◧◩
22. dlandi+z8[view] [source] [discussion] 2023-12-27 16:47:17
>>apante+51
> There is no way to know what sources have been memorized vs which have made their mark by affecting other types of functions in the neural net.

But if it's possible for the neural net to memorize passages of text then surely it could also memorize where it got those passages of text from. Perhaps not with today's exact models and technology, but if it was a requirement then someone would figure out a way to do it.

replies(2): >>wrs+tb >>Tao330+bg
◧◩◪
23. solvei+za[view] [source] [discussion] 2023-12-27 16:58:38
>>layer8+07
> Presumably, if a passage of any significant length is cited verbatim (or almost verbatim), there would have been a way to track that source through the weights.

Figure this out and you get to choose which AI lab you want to make seven figures at. It's a really difficult problem.

replies(2): >>layer8+sc >>photon+Oy
◧◩◪
24. wrs+tb[view] [source] [discussion] 2023-12-27 17:04:01
>>dlandi+z8
Except it doesn’t memorize text. It generates text that is statistically likely. Generating a citation that is statistically likely wouldn’t really help the problem.
replies(1): >>__loam+rx1
◧◩◪◨⬒
25. umvi+yb[view] [source] [discussion] 2023-12-27 17:04:39
>>pxoe+y8
Yes, computers are good at storing data. But there's a big difference between information stored in a database and information stored in a neural network. The former is well defined, the latter is a giant list of numbers - literally a black box. So in this case, the analogy to a human brain is fairly on-point because just as you can't perfectly cite every source that comes out of your (black box) brain, other black boxes have similar challenges.
◧◩◪◨⬒
26. wrs+Qb[view] [source] [discussion] 2023-12-27 17:05:30
>>pxoe+y8
The analogy to a database is also irrelevant. LLMs aren’t databases.
◧◩◪◨⬒
27. qup+Yb[view] [source] [discussion] 2023-12-27 17:06:50
>>pxoe+y8
When all the legal precedents we have are about humans, human analogies are incredibly relevant.
replies(1): >>jazzyj+im
◧◩◪◨
28. layer8+sc[view] [source] [discussion] 2023-12-27 17:09:48
>>solvei+za
It’s likely first and foremost a resource problem. “How much different would the output be if that text hadn’t been part of the training data” can _in principle_ be answered by instead of training one model, training N models where N is the number of texts in the training data, omitting text i from the training data of model i, and then when using the model(s), run all N models in parallel and apply some distance metric on their outputs. In case of a verbatim quote, at least one of the models will stand out in that comparison, allowing to infer the source. The difficulty would be in finding a way to do something along those lines efficiently enough to be practical.
replies(1): >>spunke+Xe
◧◩◪◨⬒
29. jquery+Ec[view] [source] [discussion] 2023-12-27 17:10:57
>>pxoe+y8
LLMs are not databases. There is no "citation" associated with a specific query, any more than you can cite the source of the comment you just made.
replies(1): >>aantix+8i
◧◩◪
30. tansey+Gc[view] [source] [discussion] 2023-12-27 17:11:16
>>aantix+a4
Can you imagine spending decades of your life studying antibiotics, only to have an AI graph neural network beat you to the punch by conceiving an entire new class of antibiotics (first in 60 years) and then getting published in Nature.

https://www.nature.com/articles/d41586-023-03668-1

replies(2): >>aantix+Kd >>015a+ZO1
◧◩◪
31. jquery+cd[view] [source] [discussion] 2023-12-27 17:14:23
>>aantix+U6
Ok, now please cite the source of this comment you just made. It's okay if the citation list is large, just list your citations from most probably to the least probable.
replies(1): >>aantix+og
◧◩◪◨
32. aantix+Kd[view] [source] [discussion] 2023-12-27 17:18:05
>>tansey+Gc
It looks like the published paper managed to include plenty of citations.

https://dspace.mit.edu/handle/1721.1/153216

As it should be.

◧◩◪◨⬒
33. spunke+Xe[view] [source] [discussion] 2023-12-27 17:24:33
>>layer8+sc
each llm costs ($10-100) millions to train x billions of trainings data ~= $100 quadrillion dollars, so that is unofortunately out of reach of most countries.
◧◩◪◨⬒
34. KHRZ+gf[view] [source] [discussion] 2023-12-27 17:26:03
>>pxoe+y8
OK, let's say you were given a source for an LLM output such as "Common Crawl/reddit/1000000 books collection". Would this be usefull? Probably not. Or do you want the chat system to operate magnitudes slower so it can search the peta bytes of sources and warn of similarities constantly for every sentence? That's obviously a huge waste of resources, it should probably be done by the users appropriately for their use case, such as these NY Times journalists which were easily able to find such similarities themselves for their use case of "specifically crafted prompts to output NY Times text".
◧◩◪◨
35. bluefi+mf[view] [source] [discussion] 2023-12-27 17:26:22
>>KHRZ+t7
No, but I'm a human and treating computers like humans is a huge mistake that we shouldn't make.
replies(1): >>pama+Cn
◧◩◪
36. Tao330+bg[view] [source] [discussion] 2023-12-27 17:30:36
>>dlandi+z8
Neural nets don't memorize passages of text. They train on vectorized tokens. You get a model of how language statistically works, not understanding and memory.
replies(2): >>FredPr+Ni >>tsimio+zs
◧◩◪◨
37. aantix+og[view] [source] [discussion] 2023-12-27 17:31:42
>>jquery+cd
"Now displaying 3 citations out of ~150,000,000.."

[1] http://web.archive.org/web/20120608192927/http://www.google....

[2] https://steemit.com/online/@jaroli/how-google-search-result-...

[3] https://www.smashingmagazine.com/2009/09/search-results-desi...

[4] Next page

:)

replies(1): >>pama+um
38. FredPr+6i[view] [source] 2023-12-27 17:41:30
>>aantix+(OP)
A human can't credit the source of each element of everything they've learnt. AI's can't either, and for the same reason.

The knowledge gets distorted, blended, and reinterpreted a million ways by the time it's given as output.

And the metadata (metaknowledge?) would be larger than the knowledge itself. The AI learnt every single concept it knows by reading online; including the structure of grammar, rules of logic, the meaning of words, how they relate to one another. You simply couldn't cite it all.

replies(3): >>photon+mm >>anigbr+rA >>ahepp+iG
◧◩◪◨⬒⬓
39. aantix+8i[view] [source] [discussion] 2023-12-27 17:41:57
>>jquery+Ec
That's fine. Solve it a different way.

OpenAI doesn't just get to steal work and then say "sorry, not possible" and shrug it off.

The NYTimes should be suing.

replies(4): >>MeImCo+qm >>Kim_Br+Om >>Levitz+gT >>slyall+xX
◧◩◪◨
40. FredPr+Ni[view] [source] [discussion] 2023-12-27 17:46:25
>>Tao330+bg
You can encode understanding in a vector.

To use Andrew Ng's example, you have build a multi-dimensional arrow representing "king". You compare it to the arrow for "queen" and you see that it's almost identical, except it points in the opposite direction in the gender dimension. Compare it to "man" and you see that "king" and "man" have some things in common, but "man" is a broader term.

That's getting really close to understanding as far as I'm concerned; especially if you have a large number of such arrows. It's statistical in a literal sense, but it's more like the computer used statistics to work out the meaning of each word by a process of elimination and now actually understands it.

41. whichf+aj[view] [source] 2023-12-27 17:48:52
>>aantix+(OP)
Why do you expect an AI to cite it's source? Humans are allowed to use and profit on knowledge they've learned from any and all sources without having to mention or even remember their sources.

Yes, we all agree that it's better if they do remember and mention their sources, but we don't sue them for failing to do so.

replies(1): >>firefl+0c1
42. throwu+0l[view] [source] 2023-12-27 17:57:09
>>aantix+(OP)
If you ask the AI to cite its sources, it will. It will hallucinate some of them, but in the last few months it's gotten really good at sending me to the right web page or Amazon book link for its sources.

Thing is though, if you look at the prompts they used to elicit the material, the prompt was already citing the NYTimes and its articles by name.

◧◩◪◨⬒
43. Tao330+7l[view] [source] [discussion] 2023-12-27 17:57:44
>>pxoe+y8
You'd effectively be asking it to cite sources on why the next token is statistically likely. Then it will hallucinate anyway and tell you the NYT said so. You might think you want this, but you don't.
◧◩◪◨⬒
44. Kim_Br+Gl[view] [source] [discussion] 2023-12-27 18:01:17
>>pxoe+y8
Can't have your cake and eat it too.

1. If you run different software (LLM), install different hardware (GPU/TPU), and use it differently (natural language), to the point that in many ways it's a different kind of machine; does it actually surprise you that it works differently? There's definitely computer components in there somewhere, but they're combined in a somewhat different way. Just like you can use the same lego bricks to make either a house or a space-ship, even though it's the same bricks. For one: GPT-4 is not quite going to display a windows desktop for you (right-this-minute at least)

2. Comparing to humans is fine. Else by similar logic a robot arm is not a human arm, and thus should not be capable of gripping things and picking them up. Obviously that logic has a flaw somewhere. A more useful logic might be to compare eg. Human arm, Gorilla arm, Robot arm, they're all arms!

◧◩◪◨⬒⬓
45. jazzyj+im[view] [source] [discussion] 2023-12-27 18:03:57
>>qup+Yb
There is a hundred years of legal precedents in the realm of technology upsetting the assumptions of copyright law. Humans use tools - radios, xerox machines, home video tape. AI is another tool that just makes making copies way easier. The law will be updated, hopefully without comparing an LLM to a man.
◧◩
46. photon+mm[view] [source] [discussion] 2023-12-27 18:04:17
>>FredPr+6i
> And the metadata (metaknowledge?) would be larger than the knowledge itself.

Because URLs are usually as long as the writing they point at?

replies(1): >>ahepp+7o
◧◩◪◨⬒⬓⬔
47. MeImCo+qm[view] [source] [discussion] 2023-12-27 18:04:29
>>aantix+8i
And god willing if there is any justice in the courts NYTimes will lose this frivolous lawsuit.

Copyright law is a prehistoric and corrupt system that has been about protecting the profit margins of Disney and Warner Bros rather than protecting real art and science for living memory. Unless copy/paste superhero movies are your definition of art I suppose.

Unfortunately it seems like judges and the general public are so clueless as to how this technology works it might get regulated into the ground by uneducated people before it ever has a chance to take off. All so we can protect endless listicle factories. What a shame.

replies(1): >>lewhoo+841
◧◩◪◨⬒
48. pama+um[view] [source] [discussion] 2023-12-27 18:04:44
>>aantix+og
This is not answering the GP question and does not count as a satisfactory ranked citation list. The first one is particularly dubious. Also you didn’t clarify which statement was based on which citation. I didn’t see “dog” in your text.

To help understand the complexity of an LLM consider that these models typically hold about 10,000 less parameters than the total characters in the training data. If one wants to instruct the LLM to search the web and find relevant citations it might obey this command but it will not be the source of how it formed the opinions it has in order to produce its output.

replies(1): >>jquery+vca
◧◩◪◨⬒⬓⬔
49. Kim_Br+Om[view] [source] [discussion] 2023-12-27 18:07:01
>>aantix+8i
Clearly, "theft" is an analogy here (since we can't get it to fit exactly), but we can work with it.

You are correct, if I were to steal something, surely I can be made to give it back to you. However, if I haven't actually stolen it, there is nothing for me to return.

By analogy, if OpenAI copied data from the NYT, they should be able to at least provide a reference. But if they don't actually have a proper copy of it, they cannot.

◧◩◪◨⬒
50. pama+Cn[view] [source] [discussion] 2023-12-27 18:11:31
>>bluefi+mf
Treating computers like humans in this one particular way is very appropriate. It is the only way that LLM can synthesize a worldview when their training data is many thousands of times larger than their number of parameters. Imagine scaling up the total data by another factor of 1million in a few years. There is no current technology to store that info but we can easily train large neural nets that can recreate the essence of it, just like we traditionally trained humans to recall ideas.
◧◩◪
51. Tao330+Sn[view] [source] [discussion] 2023-12-27 18:13:00
>>Foobar+35
My understanding is that this lawsuit is about the training corpus. This is on the level of asking it to cite its sources for a/an/the.
◧◩◪◨⬒
52. pama+Tn[view] [source] [discussion] 2023-12-27 18:13:04
>>pxoe+Q7
Please recall how much the world changed in just the last year. What would be your expected timescale for the solution of this particular problem and why is it more important than instilling models with the ability to logically plan and answer correctly?
replies(1): >>pxoe+ks8
◧◩◪
53. ahepp+7o[view] [source] [discussion] 2023-12-27 18:14:14
>>photon+mm
I’m not an expert in AI training, but I don’t think it’s as simple as storing writing. It does seem to be possible to get the system to regurgitate training material verbatim in some cases, but my understanding is that the text is generated probabilistically.

It seems like a very difficult engineering challenge to provide attribution for content generated by LLMs, while preserving the traits that make them more useful than a “mere” search engine.

Which is to say nothing about whether that challenge is worth taking on.

replies(2): >>tsimio+sr >>photon+Ns
◧◩◪
54. pama+po[view] [source] [discussion] 2023-12-27 18:15:36
>>aantix+a4
If the future AI can indeed cure disease my mission of working in drug discovery will be complete. I’d much rather help cure people (my brother died of melanoma) than protect any patent rights or copyrighted text.
replies(1): >>aantix+yD
◧◩
55. benrow+xo[view] [source] [discussion] 2023-12-27 18:16:21
>>8note+r5
Regression is a good analogy of the problem here. If you found a line of best fit for some datapoints, how would you get back the original datapoints, from the line?

Now imagine terabytes worth of datapoints, and thousands of dimensions rather than two.

56. devd00+2r[view] [source] 2023-12-27 18:29:17
>>aantix+(OP)
Anyone in Open Source or with common sense would agree that this is the absolute minimum that the models should be doing. Good comment.
◧◩◪◨
57. tsimio+sr[view] [source] [discussion] 2023-12-27 18:32:00
>>ahepp+7o
Conceptually, it wouldn't be very hard to take the candidate output and run it through a text matching phase to see if there are ~exact matches in the training corpus, and generate other output if there are (probably limited to the parts of the training corpus where rights couldn't be obtained normally). Of course, it would be quite compute heavy, so it would add significantly to the cost per query.
replies(1): >>TheCor+6w
◧◩◪◨
58. tsimio+zs[view] [source] [discussion] 2023-12-27 18:38:07
>>Tao330+bg
The model weights clearly encode certain full passages of text, otherwise it would be virtually impossible for the network to produce verbatim copies of text. The format is something very vaguely like "the most likely token after "call" is "me"; the most likely token after "call me" is "Ishmael". It's ultimately a kind of lossy statistical compression scheme at some level.
replies(1): >>photon+kv
◧◩◪◨
59. photon+Ns[view] [source] [discussion] 2023-12-27 18:40:04
>>ahepp+7o
Sure, it's a hard problem, but as others have pointed out frequently in this thread.. there is not only "no incentive" to solve it but a clear disincentive. If one can say where the data comes from, one might have to prove that it was used only with permission. And the reason why it's a hard problem is not related to metadata volume being greater than content volume. Clearly a book title/year published is usually shorter than book contents.
◧◩◪◨⬒
60. photon+kv[view] [source] [discussion] 2023-12-27 18:53:01
>>tsimio+zs
> It's ultimately a kind of lossy statistical compression scheme at some level.

And on this subject, it seems worthwhile to note that compression has never freed anyone from copyright/piracy considerations before. If I record a movie with a cell phone at a worse quality, that doesn't change things. If a book is copied and stored in some gzipped format where I can only read a page at a time, or only read a random page at a time, I don't think that's suddenly fair-use.

Not saying these things are exactly the same as what LLMs do, but it's worth some thought, because how are we going to make consistent rules that apply in one case but not the other?

replies(2): >>seanmc+Rv >>fennec+cr3
◧◩◪◨⬒⬓
61. seanmc+Rv[view] [source] [discussion] 2023-12-27 18:55:24
>>photon+kv
If you watch a bunch of movies then go on to make your own movie based on influence from these movies, you are protected even if you have mentally compressed them into your own movie. At some point, you can learn, be influenced and be inspired from copyrighted material (not copyright infringement), and at some point you are just making a poor copy of the material (definitely copyright infringement). LLMs are probably still at the latter case than the former, but eventually AI will reach the former case.
replies(2): >>photon+sG >>tremon+bJ
◧◩◪◨⬒
62. TheCor+6w[view] [source] [discussion] 2023-12-27 18:57:04
>>tsimio+sr
GitHub Copilot supports that:

https://docs.github.com/en/copilot/configuring-github-copilo...

Given how cheap text search is compared with LLM inference, and that GitHub reuses the same infrastructure for its code search, I doubt it adds more than 1% to the total cost.

replies(1): >>edwint+Sd2
◧◩◪◨
63. photon+Oy[view] [source] [discussion] 2023-12-27 19:12:16
>>solvei+za
> Figure this out and you get to choose which AI lab you want to make seven figures at. It's a really difficult problem.

It doesn't have to be perfect to be helpful, and even something that is very imperfect would at least send the signal that model-owners give a shit about attribution in general.

Given a specific output, it might be hard to say which sections of the very large weighted network were tickled during the output, and what inputs were used to build that section of the network. But this level of "citation resolution" is not always what people are necessarily interested in. If an LLM is giving medical advice, I might want to at least know whether it's reading medical journals or facebook posts. If it's political advice/summary/synthesis, it might be relevant to know how much it's been reading Marx vs Lenin or whatever. Pin-pointing original paragraphs as sources would be great, but for most models it's not like there's anything that's very clear about the input datasets.

EDIT: Building on this a bit, a lot of people are really worried about AI "poisoning the well" such that they are retraining on content generated by other AIs so that algorithmic feeds can trash the next-gen internet even worse than the current one. This shows that attribution-sourcing even at the basic level of "only human generated content is used in this model" can be useful and confidence-inspiring.

◧◩
64. anigbr+rA[view] [source] [discussion] 2023-12-27 19:22:00
>>FredPr+6i
Of course not, but you can cite where specific facts or theories were first published. Now, I don't think that not doing so infringes any copyright interest or that doing so creates any liability, any more than if I cited to a scientific paper or public statement of opinion by someone else.
◧◩◪◨
65. aantix+yD[view] [source] [discussion] 2023-12-27 19:39:43
>>pama+po
The point is if you stop giving proper credit, people stop publicly publishing.

Would you keep publishing articles if five people immediately stole the content and put it up on their site, claiming ownership of your research? Doubtful.

replies(1): >>solard+4Z
◧◩
66. ahepp+iG[view] [source] [discussion] 2023-12-27 19:54:47
>>FredPr+6i
At the same time, there are situations where humans are expected to provide sources for their claims. If you talk about an event in the news, it would be normal for me to ask where you heard about it. 100% accuracy in providing a source wouldn’t be expected, but if you told me you had no idea, or told me something obviously nonsense, I would probably take what you said less seriously.
replies(1): >>fennec+eq3
◧◩◪◨⬒⬓⬔
67. photon+sG[view] [source] [discussion] 2023-12-27 19:55:12
>>seanmc+Rv
There's no obvious need to hold people / AI to same standards here, yet, even if compression in mental-models is exactly analogous to compression in machine-models. I guess we decided already that corporations are already "like" persons legally, but the jury is still out on AIs. Perhaps people should be allowed more leeway to make possibly-questionable derivative works, because they have lives to live, and genuine if misguided creative urges, and bills to pay, etc. Obviously it's quite difficult to try and answer the exact point at which synthesis & summary cross a line to become "original content". But it seems to me that, if anything, machines should be held to higher standard than people.

Even if LLMs can't cite their influences with current technology, that can't be a free pass to continue things this way. Of course all data brokers resist efforts along the lines of data-lineage for themselves and they want to require it from others. Besides copyright, it's common for datasets to have all kinds of other legal encumbrances like "after paying for this dataset, you can do anything you want with it, excepting JOINs with this other dataset". Lineage is expensive and difficult but not impossible. Statements like "we're not doing data-lineage and wish we didn't have to" are always more about business operations and desired profit margins than technical feasibility.

replies(1): >>seanmc+NZ
◧◩◪◨⬒⬓⬔
68. tremon+bJ[view] [source] [discussion] 2023-12-27 20:08:08
>>seanmc+Rv
But that's not what ChatGPT is doing, or is it? ChatGPT watches and records a bunch of movies, then stitches together its own movie using scenes and frames from the movies it recorded. AI will never reach the former case until it learns to operate a camera.
replies(1): >>seanmc+4M
◧◩◪◨⬒⬓⬔⧯
69. seanmc+4M[view] [source] [discussion] 2023-12-27 20:20:47
>>tremon+bJ
How do you not know this isn’t what we are doing in some more advanced form? Anyways, the comparisons will become more apt as the tech advances.
◧◩◪◨⬒
70. Levitz+MQ[view] [source] [discussion] 2023-12-27 20:44:08
>>pxoe+y8
I'm sorry if this is too callous, but if you don't understand what you are talking about you should first familiarize yourself with the problem, then make claims about what should be done.

It would be great if we could tell specifically how something like ChatGPT creates its output, it would be great for research, so it's not like there is no interest in it, but it's just not an easy thing to do. It's more "Where did you get your identity from?" than "What's the author of that book?". You might think "But sometimes what the machine gives CAN literally be the answer to 'What is the author of that book?'" but even in those cases the answer is not restricted to the work alone, there is an entire background that makes it understand that thing is what you want.

◧◩◪
71. Levitz+mS[view] [source] [discussion] 2023-12-27 20:52:49
>>aantix+a4
>Claiming knowledge without even acknowledging potential sources is gross. Solve it OpenAI.

I'm sorry, but pretty much nobody does this. There is no "And these books are how I learned to write like this" after each text. There is no "Thank you Pitagoras!" after using the theorem. Generally you want sources, yes, but for verification and as a way to signal reliability.

Specifically academics and researchers do this, yes. Pretty much nobody else.

◧◩◪◨⬒⬓⬔
72. Levitz+gT[view] [source] [discussion] 2023-12-27 20:58:46
>>aantix+8i
Really? Solve it a different way? Do you realize the kind of tech we are talking about here?

This kind of mentality would have stopped the internet from existing. After all, it has been an absolute copyright nightmare, has it not?

If that's what copyright does then we are better without it.

◧◩◪
73. solard+KU[view] [source] [discussion] 2023-12-27 21:08:34
>>aantix+a4
If someone paid me to study cancer and I discovered a cure, I'd give it away with or without credit. Who cares?

If someone takes my software and uses it, cool. If they credit me, cool. If they don't, oh well. I'd still code.

Not everything needs to be ego driven. As long as the cancer researcher (and the future robots working alongside them) can make a living, I really don't think it matters whether they get credit outside their niches.

I have no idea who invented the CT scanner, Xray machines, the hyperdermic needle, etc. I don't really care. It doesn't really do me any good to associate Edison with light bulbs either, especially when LEDs are so much better now. I have no idea who designs the cars I drive. I go out of my way to avoid cults of personality like Tesla.

There's 8 billion of us. We all need to make a living. We don't need to be famous.

replies(2): >>aantix+NY >>bamboo+v71
◧◩◪◨⬒⬓⬔
74. slyall+xX[view] [source] [discussion] 2023-12-27 21:24:55
>>aantix+8i
You sound like one of those government people who demand encryption that has government backdoors but is perfect safe from attackers.

When told it is impossible they go "Geek Harder then Nerd" like demanding it will make it happen.

◧◩◪◨
75. aantix+NY[view] [source] [discussion] 2023-12-27 21:33:13
>>solard+KU
Your incentives are not everyone else's incentives.

If someone chooses to dedicate their life to a particular domain - they sacrifice through hard work, they make hard-earned breakthroughs, then they get to dictate how their work will be utilized.

Sure, you can give it away. Your choice. Be anonymous. Your choice.

But you don't get to decide for them.

And their work certainly doesn't deserve to be stolen by an inhumane, non-acknowledging machine.

◧◩◪◨⬒
76. solard+4Z[view] [source] [discussion] 2023-12-27 21:34:43
>>aantix+yD
Why do you think this? The entirety of Wikipedia is invisibly credited unless you go into the edit history. Most open source projects have pseudonymous contributors. People have written and will continue to write with or without credit.

Credit in academia is more the exception to the rule, and it's that cutthroat industry that needs a better, more cooperative system.

◧◩◪◨⬒⬓⬔⧯
77. seanmc+NZ[view] [source] [discussion] 2023-12-27 21:39:00
>>photon+sG
> But it seems to me that, if anything, machines should be held to higher standard than people.

If machines achieve sentience, does this still hold? Like, we have to license material for our sentient AI to learn from? They can't just watch a movie or read a book like a normal human could without having the ability to more easily have that material influence new derived works (unlike say Eragon, which is shamelessly Star Wars/Harry Potter/LOTR with dragons).

It will be fun to trip through these questions over the next 20 years.

replies(1): >>Jensso+Wa1
◧◩◪◨⬒⬓⬔⧯
78. lewhoo+841[view] [source] [discussion] 2023-12-27 22:03:04
>>MeImCo+qm
> Copyright law is a prehistoric and corrupt system that has been about protecting the profit margins of Disney and Warner Bros rather than protecting real art

These types of arguments miss the mark entirely imho. First and foremost, not every instance of copyrighted creation involves a giant corporation. Second, what you are arguing against is the unfair leverage corporations have when negotiating a deal with a rising artist.

◧◩◪◨
79. bamboo+v71[view] [source] [discussion] 2023-12-27 22:25:13
>>solard+KU
You sounds like you’re trying to be cool or karma farming ?

I have no idea who invented the CT scanner, Xray machines, the hyperdermic needle, etc. I don't really care.

Maybe you should care because those things didn’t fall out do the sky and someone sure as shit got paid to develop and build those things. You copy and pasted code is worth less, a CT scanner isn’t.

◧◩◪◨⬒⬓⬔⧯▣
80. Jensso+Wa1[view] [source] [discussion] 2023-12-27 22:43:38
>>seanmc+NZ
As long as machines needs to leech on human creativity those humans needs to be paid somehow. The human ecosystem works fine thanks to the limitations of humans. A machine that could copy things with no abandon however could easily disrupt this ecosystem resulting in less new things being created in total, it just leeches without paying anything back unlike humans.

If we make a machine that is capable of being as creative as humans and train it to coexist in that ecosystem then it would be fine. But that is a very unlikely case, it is much easier to make a dumb bot that plagiarizes content than to make something as creative as a human.

replies(1): >>seanmc+dl1
◧◩
81. firefl+0c1[view] [source] [discussion] 2023-12-27 22:50:56
>>whichf+aj
Quite simply, if you're stating things authoritatively, then you should have a source.
replies(1): >>jahews+eA1
◧◩◪◨⬒⬓⬔⧯▣▦
82. seanmc+dl1[view] [source] [discussion] 2023-12-28 00:02:12
>>Jensso+Wa1
> If we make a machine that is capable of being as creative as humans and train it to coexist in that ecosystem then it would be fine. But that is a very unlikely case, it is much easier to make a dumb bot that plagiarizes content than to make something as creative as a human.

I disagree that our own creativity doesn't work that way: nothing is very original, our current art is based on 100k years of building up from when cave man would scrawl simple art into the stone (which they copied from nature). We are built for plagiarism, and only gross plagiarism is seen as immoral. Or perhaps, we generalize over several different sources, diluting plagiarism with abstraction?

We are still in the early days of this tech, we will be having very different conversations about it even as soon as 5 years later.

◧◩◪◨
83. __loam+rx1[view] [source] [discussion] 2023-12-28 02:06:20
>>wrs+tb
So it's just bullshit then.
replies(1): >>fennec+Hq3
◧◩◪
84. jahews+eA1[view] [source] [discussion] 2023-12-28 02:34:23
>>firefl+0c1
Do you have a source for this claim?
◧◩◪◨
85. 015a+ZO1[view] [source] [discussion] 2023-12-28 05:06:32
>>tansey+Gc
As you already know yet are being intentionally daft about: They didn't use an LLM trained on copywritten material. There's a canyon of difference between leveraging AI as a tool, and AI leveraging you as a tool.

LLMs have, to my knowledge, made zero significant novel scientific discoveries. Much like crypto, they're a failure of technology to meaningfully move humanity forward; their only accomplishment is to parrot and remix information they've been trained on, which does have some interesting applications that have made Microsoft billions of dollars over the past 12 months, but let's drop the whole "they're going to save humanity and must be protected at any cost" charade. They're not AGI, and because no one has even a mote of dust of a clue as to what it will take to make AGI, its not remotely tenable to assert that they're even a stepping stone toward it.

◧◩◪◨⬒⬓
86. edwint+Sd2[view] [source] [discussion] 2023-12-28 09:50:46
>>TheCor+6w
It is questionable whether that filtering mechanism works, previous discussion: >>33226515

But even if it did an exact match search is not enough here. What if you take the source code and rename all variables and functions? The filter wouldn't trigger, but it'd still be copyright infringement (whether a human or a machine does that).

For such a filter to be effective it'd at least have to build a canonical representation of the program's AST and then check for similarities with existing programs. Doing that at scale would be challenging.

Wouldn't it be better to: * Either not include copyrighted content in the training material in the first place * Explicitly tag the training material with license and origin infornation, such that the final output can produce a proof of what training material was relevant for producing that output and don't mix differently licensed content.

◧◩◪
87. fennec+eq3[view] [source] [discussion] 2023-12-28 18:07:15
>>ahepp+iG
The raw technology behind it literally cannot do that.

The model is fuzzy, it's the learning part, it'll never follow the rules to the letter the same as humans fuck up all the time.

But a model trained to be literate and parse meaning could be provided with the hard data via a vector DB or similar, it can cite sources from there or as it finds them via the internet and tbf this is how they should've trained the model.

But in order to become literate, it needs to read...and us humans reuse phrases etc we've picked up all the time "as easy as pie" oops, copyright.

replies(1): >>ahepp+hs4
◧◩◪◨⬒
88. fennec+Hq3[view] [source] [discussion] 2023-12-28 18:09:27
>>__loam+rx1
It's literally how our meat bag brains work pretty much.

Anything like word association games are basically the same exercise, but with humans and hell, I bet I could play a word association game with an LLM, too.

◧◩◪◨⬒⬓
89. fennec+cr3[view] [source] [discussion] 2023-12-28 18:12:01
>>photon+kv
Is it still compression if I read Tolkien and reference similar or exact concepts when writing my own works?

Having a magical ring in my book after I've read lord of the rings, is that copyright?

replies(1): >>tsimio+BX4
◧◩◪◨
90. ahepp+hs4[view] [source] [discussion] 2023-12-29 00:18:13
>>fennec+eq3
I agree that the model being fuzzy is key aspect of an LLM. It doesn't sound like we're just talking about re-using phrases though. "Simple as pie" is not under copyright. We're talking about the "knowledge" that the model has obtained and in some cases spits out verbatim without attribution.

I wonder if there's any possibility to train the model on a wide variety of sources, only for language function purposes, then as you say give it a separate knowledge vector.

replies(1): >>fennec+ID4
◧◩◪◨⬒
91. fennec+ID4[view] [source] [discussion] 2023-12-29 02:10:14
>>ahepp+hs4
Sure, it definitely spits out facts, often not hallucinating. And it can reiterate titles and small chunks of copyright text.

But I still haven't seen a real example of it spitting out a book verbatim. You know where I think it got chunks of "copyright" text from GRRM's books?

Wikipedia. And https://gameofthrones.fandom.com/wiki/Wiki_of_Westeros, https://awoiaf.westeros.org/index.php/Main_Page, https://data.world/datasets/game-of-thrones all the god dammed wikis, databases etc based on his work, of which there are many, and of which most quote sections or whole passages of the books.

Someone prove to me that GPT can reproduce enough text verbatim that it makes it clear that it was trained on the original text first hand basis, rather than second hand from other sources.

◧◩◪◨⬒⬓⬔
92. tsimio+BX4[view] [source] [discussion] 2023-12-29 06:39:18
>>fennec+cr3
Generally, no, copyright deals with exact expression, not concepts. However, that can include the structure of a work, so if you wrote a book about little people who form a band together with humans and fairies and a mage to destroy a ring of power created by an ancient evil, where the start in their nice home but it gets attacked by the evil lord's knights [...] you may be breaking Tolkien's copyright.
◧◩◪◨⬒⬓
93. pxoe+ks8[view] [source] [discussion] 2023-12-30 15:02:50
>>pama+Tn
the timeline for LLMs and image generation has been 6+ years. it is not a thing where it "arrived just this year, and only just changing". it's been in a development for a long time. and yet.
◧◩◪◨⬒⬓
94. jquery+vca[view] [source] [discussion] 2023-12-31 07:33:06
>>pama+um
You mean 10,000x less parameters? In other words, only 1 character for every 10,000 characters of input?

Yeah, good luck embedding citations into that. Everyone here saying it's easy needs to go earn their 7 figure comp at an AI company instead of wasting their time educating us dummies.

[go to top]