zlacker

[parent] [thread] 13 comments
1. pschue+(OP)[view] [source] 2023-07-02 01:36:15
This. I'd bet substantial amounts of money that the evil scraper idea is the result of a) another issue + b) paranoia + c) Musk thinking he understands better than anybody else.
replies(1): >>berkle+56
2. berkle+56[view] [source] 2023-07-02 02:42:35
>>pschue+(OP)
This is a really ignorant take to dismiss scrapers. LLMs operate by having petabytes of conversational training data. Scraping is how OpenAI trained GPT. It’s how all their copycats are trying to do the same.

Elon can be a monumental asshat, and he can be self-DDOS’ing, and can be accurate about scraping at the same time. It’s why every single social media platform is heading toward becoming a walled garden.

replies(3): >>pschue+Y6 >>gmerc+sc >>dyno12+qk1
◧◩
3. pschue+Y6[view] [source] [discussion] 2023-07-02 02:49:35
>>berkle+56
I'm not denying that scrapers exist, I'm just highly suspicious of this explanation given that: a) he's proven time and time again how willing he is to say shit just to get attention b) he doesn't seem to understand software very well c) if shit was imploding for reasons related to decisions he made, this is precisely the kind of blame externalization I would expect.
replies(1): >>evan_+H7
◧◩◪
4. evan_+H7[view] [source] [discussion] 2023-07-02 02:57:50
>>pschue+Y6
Yeah, scrapers have always existed and while their traffic is undoubtedly higher than it has been in the past, it can't possibly be any significant amount of traffic when compared to the rest of the traffic hitting the site.

A real scraper would be stopped by a rate limit set to, like, 100 tweets/minute. 600 tweets/day is a completely pointless, punitive limit.

replies(2): >>manque+UL >>berkle+cm1
◧◩
5. gmerc+sc[view] [source] [discussion] 2023-07-02 03:53:41
>>berkle+56
It’s quite ignorant to assume petabytes of garbage have any value at this point. See Chinchilla
replies(1): >>berkle+jp
◧◩◪
6. berkle+jp[view] [source] [discussion] 2023-07-02 06:36:13
>>gmerc+sc
I agree, but there are hundreds if not thousands of AI startups trying to make their own relevant LLM, and they're going to be scraping Twitter. The Onion called it many years ago [1]: "400 billion tweets and not one useful bit of data was ever transmitted".

[1] https://www.youtube.com/watch?v=cqggW08BWO0&t=138s

replies(1): >>rightb+rA
◧◩◪◨
7. rightb+rA[view] [source] [discussion] 2023-07-02 08:39:22
>>berkle+jp
I can't imagine worse training data than e.g. Twitter and Reddit posts. How about like, dunno, books?

Edit: Ah, nvm, if you are trying to do a chat bot it is essentially what you want.

◧◩◪◨
8. manque+UL[view] [source] [discussion] 2023-07-02 10:57:49
>>evan_+H7
Depends on how they scrape .

100 tweets / minute is hardly deterrent for a botnet using comprised devices on non data center IPs

◧◩
9. dyno12+qk1[view] [source] [discussion] 2023-07-02 15:33:14
>>berkle+56
How do you feel about search engine indexing?
◧◩◪◨
10. berkle+cm1[view] [source] [discussion] 2023-07-02 15:41:45
>>evan_+H7
> A real scraper would be stopped by a rate limit

I'm guessing you've never played an offensive or defensive role in scraping because what you've described is in no way a problem for a serious scraping effort. I agree the rate limits are stupid. They fuck over users, they stop amateur scrapers, and do nothing whatsoever to impede professional scraping.

If you want to stop most scraping, employ device attestation techniques and TLS fingerprinting.

replies(1): >>costco+8C1
◧◩◪◨⬒
11. costco+8C1[view] [source] [discussion] 2023-07-02 17:23:10
>>berkle+cm1
But then you have to contend with this: https://github.com/bogdanfinn/tls-client... Just used this to bypass a Cloudflare check! I've never scraped Twitter but Elon said there was a large scraping operation from Oracle IPs. He could substantially raise the cost of scraping by just banning datacenter IPs. Something like p0f would probably help too. I pay for static residential proxies (basically servers running squid that somehow have IPs belonging to consumer ISPs) and with TCP fingerprinting these would be detected as Linux and expose my Windows or iPhone user-agents as inconsistent but I've never encountered a site that checks this. Although maybe sites are doing so silently but I don't notice because I don't otherwise meet the bot threshold.
replies(1): >>berkle+BP1
◧◩◪◨⬒⬓
12. berkle+BP1[view] [source] [discussion] 2023-07-02 18:39:36
>>costco+8C1
for sure, using a custom TLS library like uTLS helps -- need to inject that GREASE cipher selection. I have a suspicion that private residential proxies are out of budget for many outfits, or the IPs are too few and then simple rate limiting kicks in? Who do you use if you're willing to share? I've not been happy with the, uhh, questionable ethics of Luminati/BrightData in the past.

There are definitely more and more sites doing TLS/TCP/etc fingerprinting or device attestation for mobile APIs, but it's still pretty rare. I mean Twitter is trying to limit requests by IP, so definitely amateur hour over there.

replies(1): >>costco+tZ1
◧◩◪◨⬒⬓⬔
13. costco+tZ1[view] [source] [discussion] 2023-07-02 19:44:38
>>berkle+BP1
I use https://www.pingproxies.com/isp which is like $3/IP/month and unlimited bandwidth (I assume if you used a ridiculous amount they might charge you). Luminati pricing is extortionate. I have no idea how anyone doing anything at scale can afford $10/GB. I haven't investigated but I don't know if Twitter limits are per account or per IP.
replies(1): >>berkle+x02
◧◩◪◨⬒⬓⬔⧯
14. berkle+x02[view] [source] [discussion] 2023-07-02 19:52:36
>>costco+tZ1
Seriously. I don't even consider a provider if they want to charge for bandwidth. I'm doing about 50 TB/mo atm.
[go to top]