zlacker

[parent] [thread] 17 comments
1. brucet+(OP)[view] [source] 2023-07-01 20:08:08
Also, taking Elon's word at face value for a second... is Twitter really worth scraping for AI training or whatever?

Its a hive of misinformation, disinformation and toxicity. Its succinct I guess, but nothing is eloquent or descriptive because of the character limit. And its full of repetitive "filler" information.

Who wants that in a foundational LLM dataset?

Maybe its OK for finding labeled images... But that still seems kidna iffy.

replies(8): >>kitsun+u >>afterb+K >>Hoasi+u4 >>TillE+D7 >>ben_w+oc >>epista+0e >>exo-pl+0g >>muixoo+5l
2. kitsun+u[view] [source] 2023-07-01 20:10:34
>>brucet+(OP)
The effectiveness of this sort of lockdown is questionable anyway, because the cat's already out of the bag and there's no getting it back in. Same for Reddit. The bulk of the data's already out there and nothing these companies can do will change that.
3. afterb+K[view] [source] 2023-07-01 20:11:51
>>brucet+(OP)
Maybe someone is trying to make a disinformation bot. (half-serious)

I mean as far as uses for LLMs go that seems to me a pretty realistic one. Mass quick propaganda with little effort. Go for immediate impact, doesn't matter if people look deeper, you're just looking to get a swell of emotional reactions.

replies(1): >>brucet+M1
◧◩
4. brucet+M1[view] [source] [discussion] 2023-07-01 20:18:01
>>afterb+K
Yeah, I guess its a way to make an "engagement optimization" bot using follows/likes from posts as criteria.

... That is horrifying.

5. Hoasi+u4[view] [source] 2023-07-01 20:30:57
>>brucet+(OP)
> Also, taking Elon's word at face value for a second... is Twitter really worth scraping for AI training or whatever?

Maybe... if you build a LLM scrapping for the lulz?

6. TillE+D7[view] [source] 2023-07-01 20:48:20
>>brucet+(OP)
It's useful if you want your LLM to be able to generate tweet-like microblogging text. That does have some value.

Or maybe you want to get an aggregate idea of what people are currently talking about in the world, stuff that doesn't rise to the level of capital-n News. There aren't a lot of alternatives for that.

replies(1): >>brucet+mc
◧◩
7. brucet+mc[view] [source] [discussion] 2023-07-01 21:15:59
>>TillE+D7
Output formatting or a quick finetune/LORA can do microblogging very easily.

Yeah, lots of general chat is unfortunately stuck in Twitter (or difficult -to-scrape siloed off platforms.

8. ben_w+oc[view] [source] 2023-07-01 21:16:01
>>brucet+(OP)
The thing that LLMs bring to the table isn't factual knowledge — we already had that, even some AI projects specifically dedicated to that — but rather the ability to correctly interact with natural language.

Twitter is great for examples of that, and the toxicity and disinformation doesn't get in the way.

Conversely, a training set doesn't need to be up to date to be useful for that.

I don't know if anyone really was trying to scrape it (examples of Musk disagreeing with his own engineers come to mind), but I assume it's possible, and given the quality of code ChatGPT spits out I can easily believe a really bad scraper has been produced by someone who thought they could do without hiring a software developer. If so, they might think they can get hot stock tips or forewarning of a pandemic from which emoji people post or something — not really what an LLM is for, but loads of people (even here!) conflate all the different kinds of AI into one thing.

9. epista+0e[view] [source] 2023-07-01 21:27:43
>>brucet+(OP)
While there may be huge sections of Twitter content that are like what you describe, I haven't encountered that. Instead I see tons of hyper-focused discussion from very specialized scientists that I wouldn't see otherwise. I see lots of discussion if obscure housing policy, that I wouldn't see otherwise.

Now, this has been severely degraded by the changes that Musk has made. The spam in direct messages is off the charts now, whereas in the past I would get maybe a spam per year. And when one of my areas of interest has a post that gets popular, I have to scroll past all the insipid clout-chasing replies from blue check marks which get floated to the top of replies in an attempt to reward some of the worst people on the internet. Also the long form tweets that need to be expanded are a big deflation of user experience, as reading and replying to those are suboptimal compared to a tweet thread.

But this is also the general internet: 99% spam plus 1% quality. And the quality of the 1% of good Twitter is some of the very best of timer material out there.

And since LLMs have been trained on this same mix... they seem to be mostly good at filtering. But they do lie an awful lot.

replies(3): >>rvba+Oh >>michae+7k >>brucet+OD
10. exo-pl+0g[view] [source] 2023-07-01 21:39:40
>>brucet+(OP)
Don't write Elon off. If your goal is to create a toxic misinformation bot, Twitter is indispensable.
◧◩
11. rvba+Oh[view] [source] [discussion] 2023-07-01 21:50:10
>>epista+0e
As someone who doesnt use twitter, I dont understand how can you have any sort of a real discussion with a 140 character limit.

The best discussion platform is IMHO the older version of reddit / i.reddit with the nested comments + possibility to be indexed by google + possibility to reply to old posts. The super-nesting comments feature is great.

replies(1): >>epista+Gi
◧◩◪
12. epista+Gi[view] [source] [discussion] 2023-07-01 21:57:26
>>rvba+Oh
It's a 280 character per message limit, with replies.

This is actually hugely beneficial to discussion as it makes people focus on the most salient point first, and further points go below, and each are easy to address individually.

Longer form material goes to outside links, sometimes, but Twitter threads are also great for long form content. At least for executive summaries that link out to the detailed bits for each primary point. Once the UI for Twitter prioritized threading, it became quite easy to express extremely long chains of evidence.

replies(1): >>mkl+do
◧◩
13. michae+7k[view] [source] [discussion] 2023-07-01 22:06:37
>>epista+0e
Can you share some profiles/contents like this? I've been searching for it and failing miserably
replies(1): >>epista+gm
14. muixoo+5l[view] [source] 2023-07-01 22:12:45
>>brucet+(OP)
I once got paid $20 as an undergrad to go through hundreds of thousands of tweets and convert slang into plain english for training data. The only thing I took away from the experience, aside from finally getting good with vim macros, is the average tweet is really low effort an uninteresting. I don't recall reading a single thing that I would imagine someone retweeting (think that's what it's called). Maybe I was given only replies. Anyway, not sure if there's value there for LLMs, but I'd be skeptical.
◧◩◪
15. epista+gm[view] [source] [discussion] 2023-07-01 22:20:09
>>michae+7k
I would scroll through my timeline, but it is now impossible to show you the good content.

Often times the best posters are not the same people publishing the best stuff in their field, but sometimes they are. Aggregators are a different category.

What types of science are you interested in? Some random accounts that I can see right now:

@ShanuMathew93 - renewable energy tech and biz and news

@IdoTheThinking - California housing

@TheStalwart - finance, macroceconomics, microeconomics, etc.

@doctorveera - general genomics

◧◩◪◨
16. mkl+do[view] [source] [discussion] 2023-07-01 22:34:32
>>epista+Gi
Twitter threads seem awful for long form content. I have never seen long form content on Twitter that I could be sure I'd seen the way the author intended.
replies(1): >>phatfi+Hw
◧◩◪◨⬒
17. phatfi+Hw[view] [source] [discussion] 2023-07-01 23:50:47
>>mkl+do
As a light user of Twitter (and not at all at the moment, i don't have an account) the character limit for tweets felt like a good thing.

The tweet threads are not terrible, but are inconvenient enough for people to be succinct as possible. Now there are walls of text from blue check marks that like the sound of their own voice far more than their content is insightful.

Sure I've read interesting long tweets, but I'd rather have a link to another site meant for long form writing than it living on Twitter, doubly so now as what bits of good content there were are behind a login wall.

But i get it, Elon needed something to make the blue check "worth it".

◧◩
18. brucet+OD[view] [source] [discussion] 2023-07-02 00:55:33
>>epista+0e
Well we should train LLMs from your old feed then!

I am only half kidding. "Profiles of specialized Twitter readers" would be an excellent dataset if it could somehow be filtered down to that.

[go to top]