zlacker

[parent] [thread] 0 comments
1. masswe+(OP)[view] [source] 2023-07-08 07:30:55
May I suggest a more general "harvest.txt" for all purposes of scraping content?

Edit: Alternatively, have a "Harvest" section in "robots.txt", using the same established syntax and semantics. This may come with the advantage of making it clear that agents should default to the general "robots.txt" rules in absence of any such rules. Moreover, existing content management systems will already provide means for maintaining "robots.txt" and there's no need to update those. (We may also introduce an "Index" section for the established purpose of "robots.txt", with any bare, untitled rules defaulting to this, thus providing compatibility.)

Example:

  #file "robots.txt"

  Index # optional section heading (maybe useful for switching context)
  User-agent: *
  Allow: /
  Disallow: /test/
  Disallow: /private/
  
  User-agent: Badbot
  Disallow: /
  
  Harvest # additional rules for scraping
  User-agent: *
  Disallow: /blog/
  Disallow: /protected-artwork/
[go to top]