zlacker

[parent] [thread] 1 comments
1. margin+(OP)[view] [source] 2023-07-08 19:10:21
There are problems with robots.txt if you actually try to implement it for a crawler. Consider this scenario:

  Allow: /foo
  Disallow: /bar
Consider the situation where /foo HTTP 301s to /bar, or 200s but with a canonical location header that is /bar. Do you follow the redirect? Do you index /foo?

In practice it's also often a directory of the paths the website owners don't want eyes to look at. Pretty common to find a list of uncomfortable content, especially on larger websites... like that time the dean of the college praised the philanthropy of Boko Haram. Real OSINT footgun.

replies(1): >>Animat+Z41
2. Animat+Z41[view] [source] 2023-07-09 05:34:31
>>margin+(OP)
> There are problems with robots.txt if you actually try to implement it for a crawler.

Yes, although that's not what people are usually worried about.

I once tried to deal with that in Sitetruth's crawler. There are redirects at the HTTP level, redirects at the HTML level, and the HTTP->HTTPS thing. Resolving all that honestly is annoying, but possible. Sometimes you do need to look at the beginning of a file blocked by "robots.txt" to find that it is redirecting you elsewhere. It's like a door that says both "Keep Out" and "Please Use Other Door".

This is more of a pedantic problem than a real one.

[go to top]