zlacker

> There are problems with robots.txt if you actually try to implement it for a crawler.

Yes, although that's not what people are usually worried about.

I once tried to deal with that in Sitetruth's crawler. There are redirects at the HTTP level, redirects at the HTML level, and the HTTP->HTTPS thing. Resolving all that honestly is annoying, but possible. Sometimes you do need to look at the beginning of a file blocked by "robots.txt" to find that it is redirecting you elsewhere. It's like a door that says both "Keep Out" and "Please Use Other Door".

This is more of a pedantic problem than a real one.