zlacker

[parent] [thread] 4 comments
1. berkle+(OP)[view] [source] 2023-07-02 15:41:45
> A real scraper would be stopped by a rate limit

I'm guessing you've never played an offensive or defensive role in scraping because what you've described is in no way a problem for a serious scraping effort. I agree the rate limits are stupid. They fuck over users, they stop amateur scrapers, and do nothing whatsoever to impede professional scraping.

If you want to stop most scraping, employ device attestation techniques and TLS fingerprinting.

replies(1): >>costco+Wf
2. costco+Wf[view] [source] 2023-07-02 17:23:10
>>berkle+(OP)
But then you have to contend with this: https://github.com/bogdanfinn/tls-client... Just used this to bypass a Cloudflare check! I've never scraped Twitter but Elon said there was a large scraping operation from Oracle IPs. He could substantially raise the cost of scraping by just banning datacenter IPs. Something like p0f would probably help too. I pay for static residential proxies (basically servers running squid that somehow have IPs belonging to consumer ISPs) and with TCP fingerprinting these would be detected as Linux and expose my Windows or iPhone user-agents as inconsistent but I've never encountered a site that checks this. Although maybe sites are doing so silently but I don't notice because I don't otherwise meet the bot threshold.
replies(1): >>berkle+pt
◧◩
3. berkle+pt[view] [source] [discussion] 2023-07-02 18:39:36
>>costco+Wf
for sure, using a custom TLS library like uTLS helps -- need to inject that GREASE cipher selection. I have a suspicion that private residential proxies are out of budget for many outfits, or the IPs are too few and then simple rate limiting kicks in? Who do you use if you're willing to share? I've not been happy with the, uhh, questionable ethics of Luminati/BrightData in the past.

There are definitely more and more sites doing TLS/TCP/etc fingerprinting or device attestation for mobile APIs, but it's still pretty rare. I mean Twitter is trying to limit requests by IP, so definitely amateur hour over there.

replies(1): >>costco+hD
◧◩◪
4. costco+hD[view] [source] [discussion] 2023-07-02 19:44:38
>>berkle+pt
I use https://www.pingproxies.com/isp which is like $3/IP/month and unlimited bandwidth (I assume if you used a ridiculous amount they might charge you). Luminati pricing is extortionate. I have no idea how anyone doing anything at scale can afford $10/GB. I haven't investigated but I don't know if Twitter limits are per account or per IP.
replies(1): >>berkle+lE
◧◩◪◨
5. berkle+lE[view] [source] [discussion] 2023-07-02 19:52:36
>>costco+hD
Seriously. I don't even consider a provider if they want to charge for bandwidth. I'm doing about 50 TB/mo atm.
[go to top]