zlacker

[parent] [thread] 8 comments
1. shaneb+(OP)[view] [source] 2023-05-10 13:01:25
"Robots.txt has failed as a system, if it hadn't we wouldn't have captchas or Cloudflare."

I like the idea of "ai.txt" but those who eat resources rarely listen to ToS. Frankly, I serve 503s to all identifiable bots, unless they are on my explicit allow list.

replies(2): >>always+p1 >>spc476+A31
2. always+p1[view] [source] 2023-05-10 13:08:25
>>shaneb+(OP)
Why not serve fake garbage indistinguishable from real content by a computer, like LLM output? Sending errors just incentivizes bot owners to fix the identifiable parts
replies(4): >>shaneb+r5 >>twelve+86 >>ape4+f8 >>dspill+7b
◧◩
3. shaneb+r5[view] [source] [discussion] 2023-05-10 13:28:25
>>always+p1
"Why not serve fake garbage indistinguishable from real content by a computer, like LLM output?"

Serving more than the minimum wastes resources. Worse yet, a better solution would cost my time.

"Sending errors just incentivizes bot owners to fix the identifiable parts"

Sure, someone could make or configure their scraper perfectly. "Perfect" is now the table stakes though.

Edit:

My solution strives to cause an unproportional expense in order to circumvent. I want 10x on my time.

◧◩
4. twelve+86[view] [source] [discussion] 2023-05-10 13:31:14
>>always+p1
it'd be cool to be able to fingerprint that garbage, too. Like, sprinkle some hashes here and there (or something like that) so that you can later uniquely look up your own "content" being stolen by chatbots and which ones.
replies(1): >>shaneb+f7
◧◩◪
5. shaneb+f7[view] [source] [discussion] 2023-05-10 13:36:55
>>twelve+86
You can. I can't think of the appropriate term though. Hopefully someone else chimes in here with a link.
◧◩
6. ape4+f8[view] [source] [discussion] 2023-05-10 13:41:22
>>always+p1
I like this idea. Of course it would have to be only to robots that visit a page disallowed by the robots.txt
◧◩
7. dspill+7b[view] [source] [discussion] 2023-05-10 13:55:12
>>always+p1
> Sending errors just incentivizes bot owners to fix the identifiable parts

Nah. It'll just make them fake their identity so it is harder to tell the traffic is from a bot.

8. spc476+A31[view] [source] 2023-05-10 17:47:19
>>shaneb+(OP)
It might be a better idea to serve up a 418 ("I'm a tea pot") with a line line text file saying "I'm not an HTTP server". That solved a problem I had with bots making HTTP requests to my gopher server [1]. Serving up a 503 informs the bot that there's a server issue and it may try again later. A 418 informs the bot that it made an erroneous request and such an odd error code might get someone to look into it and stop.

[1] https://boston.conman.org/2019/09/30.2

replies(1): >>shaneb+0w7
◧◩
9. shaneb+0w7[view] [source] [discussion] 2023-05-12 13:01:36
>>spc476+A31
This is very interesting. I've bookmarked the link. Thanks for sharing. I believe minimal is best and this might fit nicely within my larger system. Do you approach other problems with a similar mindset?
[go to top]