zlacker

That's still not an argument to introduce ai.txt, because everything a hypothetical ai.txt could ever do is already done just as good (or not) by the robots.txt we have. If a training data crawler ignores robots.txt it won't bother checking for an ai.txt either.

And if you feel like rolling out the "welcome friend!" doormat to a particular training data crawler, you are free to dedicate as detailed a robots.txt block as you like to its user agent header of choice. No new conventions needed, everything is already on place.

replies(3): >>michae+Ng >>irobet+6l >>joshua+xo1

>>usrusr+(OP)
This seems to be assuming a very different purpose for ai.txt than the OP proposed. It sounds like they are intending ai.txt to give useful contextual information to crawlers collecting AI training data. Robots.txt does not have any of this information (although I suppose you could include it in comments).

>>usrusr+(OP)
worse, ai.txt could become an adversarial vector for attempts to trick the AI into filing your information under some semantic concept

>>usrusr+(OP)
I do think that robots.txt is pretty useful. If I want my content indexed, I can help the engine find my content. If indexing my content is counterproductive, then I can ask that it be skipped. So it helps the align my interests with the search engine; I can expose my content or I can help the engine avoid wasting resources indexing something that I don't want it to see.

It would also be useful to distinguish training crawlers from indexing crawlers. Maybe I'm publishing personal content. It's useful for me to have it indexed for search, but I don't want an AI to be able to simulate me or my style.