zlacker

[parent] [thread] 58 comments
1. agnost+(OP)[view] [source] 2023-11-20 01:27:50
Many are speculating that Sam Altman could just move on and create another OpenAI 2.0 because he could easily attract talent and investors.

What this misses is all the regulatory capture that he’s been campaigning for. All the platforms have now closed their gardens. Authors and artists are much more vigilant about copyright etc. So it’s now a totally different game compared to 3 years ago because the data is not just there up for grabs anymore.

replies(12): >>sensan+91 >>treesc+g1 >>shoele+v2 >>Zambyt+D2 >>Shekel+P2 >>facu17+X5 >>shmatt+X7 >>aj0str+fh >>jacque+oj >>yalogi+Ym >>qingch+jn >>pauldd+Nz
2. sensan+91[view] [source] 2023-11-20 01:37:01
>>agnost+(OP)
I've never considered this angle, but god it'd be hilarious if this ended up being the case, the dude ruining everything because of his own greed ultimately fucking himself over because of it.

Here's to hoping there's still some poetic irony left to dish out in the world.

replies(1): >>Loughl+62
3. treesc+g1[view] [source] 2023-11-20 01:37:43
>>agnost+(OP)
"Why Sam Altman (who can have the funding, talent, and the vision OpenAI has right now) can't just create OpenAI 2.0?" is an amazing question that also answers whats OpenAI's moat.

People speculated it was the funding, or attracting talent or having "access". Turns out it was none of them (obviously they all have a part, but having all three doesn't mean you can best OpenAI which gives you the fundemental reason why it is so hard to compete with them).

◧◩
4. Loughl+62[view] [source] [discussion] 2023-11-20 01:43:04
>>sensan+91
Seriously, this will be the exact reason the word schadenfreude exists.
5. shoele+v2[view] [source] 2023-11-20 01:45:31
>>agnost+(OP)
Is the logic here that training a base model isn’t as easy or even possible in the same way that OpenAI did in the past, and that what they have in a trained model is valuable in that even with all the code and experience it couldn’t be reproduced today with new restrictions?
replies(1): >>codere+gb
6. Zambyt+D2[view] [source] 2023-11-20 01:46:16
>>agnost+(OP)
He should make ClosedAI and then publish all the work as open source
replies(2): >>pjot+Pg >>plorg+bB
7. Shekel+P2[view] [source] 2023-11-20 01:47:30
>>agnost+(OP)
I don't think getting training data is that hard still, the biggest platforms that locked down their APIs still use them for their mobile apps and can easily be reverse engineered to find keys or undocumented endpoints (or in the case of reddit, an entirely different internal API with less limits and a lot more info leaks...)
replies(4): >>bloqs+c3 >>thunks+Y3 >>monoca+fa >>mongol+Ol
◧◩
8. bloqs+c3[view] [source] [discussion] 2023-11-20 01:51:39
>>Shekel+P2
Can you explain the reddit one?
replies(2): >>4death+W3 >>Shekel+ud
◧◩◪
9. 4death+W3[view] [source] [discussion] 2023-11-20 01:55:57
>>bloqs+c3
Assuming the Reddit app does not use certificate pinning, you can use your computer to provide internet to your phone and then use an app like Charles Proxy to inspect requests being made from an app. Pretty easy to reverse engineer the API.

If the app does use certificate pinning, then you can use an Android phone and a modified app that removes the logic that enforces certificate pinning. This is more involved but also not impossible.

replies(3): >>philis+o4 >>patcon+5h >>gumbal+Se1
◧◩
10. thunks+Y3[view] [source] [discussion] 2023-11-20 01:56:32
>>Shekel+P2
I think its a lot harder, while you still have lots of lawsuits coming in against AI models
◧◩◪◨
11. philis+o4[view] [source] [discussion] 2023-11-20 01:59:20
>>4death+W3
That does not sound like the proper way to do an openAI 2.0. If Reddit ever hears that's how an AI company scraped them, they'll get sued for fun and profits.
replies(4): >>az226+b5 >>4death+16 >>wahnfr+r6 >>Shekel+Qd
◧◩◪◨⬒
12. az226+b5[view] [source] [discussion] 2023-11-20 02:06:52
>>philis+o4
You can legally scrape anything that does not require a login in the US. You can also legally train an AI on it for now.
replies(1): >>ejstro+n8
13. facu17+X5[view] [source] 2023-11-20 02:12:52
>>agnost+(OP)
You are assuming he wouldn't steal t from OpenAI. He could have a low level employee steal it, and manage to keep it a secret until AGI is born then he takes over the world.
replies(1): >>jacque+Ej
◧◩◪◨⬒
14. 4death+16[view] [source] [discussion] 2023-11-20 02:13:14
>>philis+o4
The point is that the data is easily accessible. If you wanted to get your hands on the data while simultaneously keeping them clean, contract with a Russian contracting company to give you a data dump. You don't need to know how they got it.
replies(2): >>twoodf+37 >>mr_toa+Gc
◧◩◪◨⬒
15. wahnfr+r6[view] [source] [discussion] 2023-11-20 02:15:42
>>philis+o4
you're aware openai trained on a boatload of pirated ebooks?

they "steal" access to data because the LLM launders it on the other end

replies(2): >>philis+68 >>bko+Db
◧◩◪◨⬒⬓
16. twoodf+37[view] [source] [discussion] 2023-11-20 02:20:46
>>4death+16
Well, until discovery, wherein your deliberate not knowing will be a pretty big deal.
17. shmatt+X7[view] [source] 2023-11-20 02:28:34
>>agnost+(OP)
Exactly. If new Altman AI company is the next big thing, then Astral AI is the next trillion dollar company

Easier said than done

◧◩◪◨⬒⬓
18. philis+68[view] [source] [discussion] 2023-11-20 02:29:12
>>wahnfr+r6
That is frustrating to no end. If I pirate one book I should pay a hefty fine. If a company does it it's unlocking untapped value.
◧◩◪◨⬒⬓
19. ejstro+n8[view] [source] [discussion] 2023-11-20 02:30:24
>>az226+b5
Are you referring to the LinkedIn case? There has not been a decision on the legality of scraping in that matter
◧◩
20. monoca+fa[view] [source] [discussion] 2023-11-20 02:44:46
>>Shekel+P2
Easier than that would just be downloading the torrent of all of Reddit through Sept 2023.

https://academictorrents.com/details/89d24ff9d5fbc1efcdaf9d7...

replies(2): >>q7xvh9+Ya >>PaulDa+xk
◧◩◪
21. q7xvh9+Ya[view] [source] [discussion] 2023-11-20 02:49:59
>>monoca+fa
That's fascinating that the total size is so tiny — only 2.4 TB‽

I assume this must be only the text portion, and heavily compressed?

replies(1): >>lxgr+Vf
◧◩
22. codere+gb[view] [source] [discussion] 2023-11-20 02:51:35
>>shoele+v2
Yes, potentially.

The data has to come from somewhere, and all of the outlets that were used to train ChatGPT, stable diffusion, etc. have since been locked down. Any new company that Sam Altman makes in the AI space won't be competing just on merits of talent and product, they will also need to pay for and negotiate access to data.

I'd actually expect this to get far worse going forward, now that other organizations have an idea of how valuable their data is. It's also trivial to justify locking it down under the guise of protecting people, privacy, etc.

replies(1): >>rblatz+Dk
◧◩◪◨⬒⬓
23. bko+Db[view] [source] [discussion] 2023-11-20 02:54:57
>>wahnfr+r6
What do you base this on?

Llms know the contents of books because they are analyzed, reviewed and spoken about everywhere. Pick some obscure book that doesn't show up on any social media and ask about it's contents. GPT won't have a clue

replies(1): >>wahnfr+Yd
◧◩◪◨⬒⬓
24. mr_toa+Gc[view] [source] [discussion] 2023-11-20 03:01:19
>>4death+16
Subcontracting out your crimes isn’t going to fly in court.
replies(2): >>4death+fl >>flir+Al
◧◩◪
25. Shekel+ud[view] [source] [discussion] 2023-11-20 03:07:57
>>bloqs+c3
The reddit app uses an undocumented graphql based api seperate from the publicly available rest api used by third party apps.
◧◩◪◨⬒
26. Shekel+Qd[view] [source] [discussion] 2023-11-20 03:10:31
>>philis+o4
It's essentially impossible to prove in court that training data was obtained or used improperly unless you go and tell on yourself. And even then it requires you to actually make someone with a lot of money mad, or to not have enough money yourself. Certainly microsoft would have already caught lots of flak for training their models on every github repo, instead they got a minor paddling from the public eye that went away after not much time had passed.
replies(1): >>mongol+7m
◧◩◪◨⬒⬓⬔
27. wahnfr+Yd[view] [source] [discussion] 2023-11-20 03:12:47
>>bko+Db
https://qz.com/openai-books-piracy-microsoft-meta-google-cha....

What's your evidence contrary to this? Sounds like your common sense rather than inside knowledge

replies(1): >>bko+7B2
◧◩◪◨
28. lxgr+Vf[view] [source] [discussion] 2023-11-20 03:30:37
>>q7xvh9+Ya
Text really doesn't take up that much space, and in addition it compresses pretty well.

The entire English language Wikipedia is only around 60GB in a format that can be readily searched and randomly accessed (ZIM), for example: https://kiwix.org/

replies(1): >>lmm+Rg
◧◩
29. pjot+Pg[view] [source] [discussion] 2023-11-20 03:38:51
>>Zambyt+D2
Yeah that will show ‘em!
◧◩◪◨⬒
30. lmm+Rg[view] [source] [discussion] 2023-11-20 03:39:12
>>lxgr+Vf
Does Kiwix actually work? I see people hyping it here but I could never get it to actually, y'know, download the file and display the wikipedia on my phone.
replies(3): >>lxgr+8h >>pc2slo+7i >>vatuei+ki
◧◩◪◨
31. patcon+5h[view] [source] [discussion] 2023-11-20 03:41:44
>>4death+W3
Yeah! <3 https://github.com/mitmproxy/android-unpinner
◧◩◪◨⬒⬓
32. lxgr+8h[view] [source] [discussion] 2023-11-20 03:42:07
>>lmm+Rg
It works perfectly for me, both on iOS and macOS.
33. aj0str+fh[view] [source] 2023-11-20 03:44:08
>>agnost+(OP)
I would bet the senior researchers know exactly how and where to get plenty of tokens.
◧◩◪◨⬒⬓
34. pc2slo+7i[view] [source] [discussion] 2023-11-20 03:55:32
>>lmm+Rg
Just downloaded. Doesn't seem to want to download Wikipedia on my phone, it says "detecting if filesystem supports 4gb files"
◧◩◪◨⬒⬓
35. vatuei+ki[view] [source] [discussion] 2023-11-20 03:58:28
>>lmm+Rg
Kiwix worked for me. IIRC there may be difficulties opening an archive that was downloaded outside of the mobile app, but archives downloaded in-app were fine.

For the mobile app I used one of the smaller Wikipedia subsets, since I didn't want to take up too much space on my phone. The full offline Wikipedia download is saved to my laptop.

36. jacque+oj[view] [source] 2023-11-20 04:10:20
>>agnost+(OP)
What has been crawled stays crawled and there are plenty of copies of sets of tokens that can be used to retrain a model. For a bit of money you can probably get any set that you really want (bit: billions, but pocket change for anything that is going to go head to head with OpenAI).
◧◩
37. jacque+Ej[view] [source] [discussion] 2023-11-20 04:13:31
>>facu17+X5
This is a pretty wild comment. That's a very safe assumption and no low level employee will do Sams bidding in an illegal enterprise. And keeping it a secret isn't going to work either and whether or not AGI is 'born' (who will bear it) is an open question to which I hope the answer is 'not for a while'. Because we haven't even figured out how to get humans to cooperate which I think should be a prerequisite.
replies(1): >>mcpack+cu
◧◩◪
38. PaulDa+xk[view] [source] [discussion] 2023-11-20 04:21:54
>>monoca+fa
The question is: if you then added all of Usenet before, say, 1992, would the effective intelligence of the trained LLM go up or down?
replies(1): >>yjftsj+pl
◧◩◪
39. rblatz+Dk[view] [source] [discussion] 2023-11-20 04:22:53
>>codere+gb
Couldn’t he just partner with Microsoft and start with the Bing corpus?
replies(1): >>piuant+vl
◧◩◪◨⬒⬓⬔
40. 4death+fl[view] [source] [discussion] 2023-11-20 04:30:30
>>mr_toa+Gc
Really? It's done pretty regularly to limit liability.
replies(1): >>Nasrud+ND
◧◩◪◨
41. yjftsj+pl[view] [source] [discussion] 2023-11-20 04:31:58
>>PaulDa+xk
I can't speak to intelligence, but the result would be ignorant in a meaningful way.
◧◩◪◨
42. piuant+vl[view] [source] [discussion] 2023-11-20 04:32:26
>>rblatz+Dk
What if Microsoft, et al, signed exclusivity with OpenAI?
replies(1): >>Comman+S23
◧◩◪◨⬒⬓⬔
43. flir+Al[view] [source] [discussion] 2023-11-20 04:33:23
>>mr_toa+Gc
If it's done in a country where it's legal, maybe even processed in the same country and all you take out is the weights, I bet it gets a bit muddier.
◧◩
44. mongol+Ol[view] [source] [discussion] 2023-11-20 04:36:07
>>Shekel+P2
That would still pose a legal problem.
◧◩◪◨⬒⬓
45. mongol+7m[view] [source] [discussion] 2023-11-20 04:39:08
>>Shekel+Qd
It is not impossible. You can call witnesses, refer to emails, source code etc.
46. yalogi+Ym[view] [source] 2023-11-20 04:48:30
>>agnost+(OP)
OpenAI has enough momentum and built enough moat that Sam Altman cannot replicate it. If he can actually replicate it and over take openai, then the business itself has no legs as it will be easily commoditized and any moat nullified in no time
replies(1): >>pauldd+Yz
47. qingch+jn[view] [source] 2023-11-20 04:51:29
>>agnost+(OP)
There are still huge vaults of untapped data.

I'm building a magazine encyclopedia and I would estimate that 99.9% of all magazines ever published are not available electronically. And that the content in magazines probably exceeds the content in books by an order of magnitude.

replies(1): >>anacro+FC
◧◩◪
48. mcpack+cu[view] [source] [discussion] 2023-11-20 05:34:47
>>jacque+Ej
> no low level employee will do Sams bidding in an illegal enterprise

Many people have betrayed their country to foreign governments in exchange for mere thousands of dollars. It is never safe to rule out the willingness of employees to engage in corporate espionage, even in exchange for truly pitiful rewards. It would be a stupid idea, but that doesn't mean it won't happen.

49. pauldd+Nz[view] [source] 2023-11-20 06:08:37
>>agnost+(OP)
There are massive numbers of archives.
◧◩
50. pauldd+Yz[view] [source] [discussion] 2023-11-20 06:09:27
>>yalogi+Ym
...unless he takes half the key employees with him.
◧◩
51. plorg+bB[view] [source] [discussion] 2023-11-20 06:15:23
>>Zambyt+D2
Do BoringAI or LibreAI and it's just a fork but you ripped out all the old, bad stuff. (This joke doesn't really work because OpenAI is not really old enough for legacy cruft and isn't actually open enough to just be forked)
◧◩
52. anacro+FC[view] [source] [discussion] 2023-11-20 06:25:12
>>qingch+jn
Orders of magnitude require at least 3 data points, and you have only 2.
replies(1): >>Vingdo+nH
◧◩◪◨⬒⬓⬔⧯
53. Nasrud+ND[view] [source] [discussion] 2023-11-20 06:33:06
>>4death+fl
They make a point out of not directly asking for the crime when they do that. Just increasing pressure on subcontractors that leads to cutting corners including the law.

It is harder to prove to a "should have known" standard compared to say buying stolen speakers from the back of a truck for 20% of the list price.

replies(1): >>4death+PC2
◧◩◪
54. Vingdo+nH[view] [source] [discussion] 2023-11-20 06:57:03
>>anacro+FC
I know this is getting off-topic, but as a non-native speaker, I'm interested in hearing how a third data point would be needed to judge whether things differ "by an order of magnitude". I was under the impression that "an order of magnitude" meant "one more digit", meaning very roughly a 10x difference. "a >= 10*b" can be determined without the need of a third data point. Is there some other meaning to the phrase I haven't come across?
replies(1): >>pilotn+6N1
◧◩◪◨
55. gumbal+Se1[view] [source] [discussion] 2023-11-20 10:01:37
>>4death+W3
Why y’all desperate to steal data to train non intelligent software? Reddit and others should sue for license violations.
◧◩◪◨
56. pilotn+6N1[view] [source] [discussion] 2023-11-20 13:39:21
>>Vingdo+nH
Not the original poster, but you have it more or less correct. An order of magnitude is 10X. Orders of magnitude just refers to “at least 100X.” Colloquially, orders of magnitude just means “significantly more/less.”
◧◩◪◨⬒⬓⬔⧯
57. bko+7B2[view] [source] [discussion] 2023-11-20 17:21:46
>>wahnfr+Yd
Did you read the article (this one misstates the case but if you look at the one linked about the lawsuit)? This is a lawsuit. Nothing has been proven. Burden of proof is on you
◧◩◪◨⬒⬓⬔⧯▣
58. 4death+PC2[view] [source] [discussion] 2023-11-20 17:26:47
>>Nasrud+ND
There’s an implicit assumption in your argument that you’re going to directly ask for a crime to be committed. Why are you assuming that? You’ll go to a contractor and say “we want Reddit data.” Anyone with even mild technical competence can figure out how to get it.
◧◩◪◨⬒
59. Comman+S23[view] [source] [discussion] 2023-11-20 18:57:21
>>piuant+vl
By the sounds of it, they just DID.
[go to top]