zlacker

[parent] [thread] 4 comments
1. margin+(OP)[view] [source] 2022-07-08 21:04:35
A while back I changed my search engine's crawl data to be ZSTD compressed JSON. It's a bit finnicky to work with, but I'm beginning to realize just how powerful this is.

Could literally just do

  find -name \*.zstd -exec zstdcat {} \; |
    jq 'first(select(.doc|select(.!=null)|.[].headers|select(.!=null)|test("[xX]-[aA]dblock-[kK]ey")))'
and it spewed out samples of domains with a header like X-Adblock-Key. (I'm not great with JQ, so there's probably a better way of doing this, but this unga bunga approach works too)

Specifically, today I did some research on a few tags and headers supposedly associated with "Acceptable Ads" (a standard for showing ads through complicit adblockers), and ended up with a fairly reliable fingerprint for a network of domain squatters that have been a nuisance in my search engine database. Turns out they're basically the only ones that use the headers and tags I was looking at, so now I'm onto their IP-ranges as well.

replies(2): >>higero+8d >>Bonobo+1w
2. higero+8d[view] [source] 2022-07-08 21:57:31
>>margin+(OP)
I don't have much context about your technical requirements but can I ask why JSON instead of a more indexable format?
replies(1): >>margin+3k
◧◩
3. margin+3k[view] [source] [discussion] 2022-07-08 22:21:54
>>higero+8d
It's a tradeoff between ease of writing, and ease of reading for indexing, and freeform analytical usecases like this. JSON caters to all fairly well.

It's one file per domain, so looking at specific urls is no prob with this setup.

4. Bonobo+1w[view] [source] 2022-07-08 23:12:14
>>margin+(OP)
So the Domain squatter pays to let Adblockers show ads on their sites?
replies(1): >>margin+Iz
◧◩
5. margin+Iz[view] [source] [discussion] 2022-07-08 23:28:30
>>Bonobo+1w
Yeah, appears to be something like that. Very convenient for me.
[go to top]