zlacker

[parent] [thread] 15 comments
1. brooks+(OP)[view] [source] 2023-05-10 13:10:27
> Robots.txt has failed as a system, if it hadn't we wouldn't have captchas or Cloudflare.

Failing to solve every problem does not mean a solution is a failure.

From sunscreen to seatbelts, the world is full of great solutions that occasionally fail due to statistics and large numbers.

replies(4): >>samwil+e1 >>bileka+m8 >>usrusr+W9 >>vlunkr+2n
2. samwil+e1[view] [source] 2023-05-10 13:16:31
>>brooks+(OP)
Ok, fair point, I may be being a little hyperbolic. But my point is that it's not a system that we should copy for preventing the use of content in training AI. It would become a useless distraction.

If you "violate" a robots.txt the server administrator can choose to block your bot (if they can fingerprint it) or IP (if its static).

With an ai.txt there is no potential downside to violating it - unless we get new legislation enforcing its legal standing. The nature of ML models is that it's opaque what content exactly it's trained on, there is no obvious retaliation or retribution.

replies(4): >>Wowfun+x3 >>Burnin+d4 >>capabl+m4 >>jefftk+L5
◧◩
3. Wowfun+x3[view] [source] [discussion] 2023-05-10 13:28:10
>>samwil+e1
> But my point is that it's not a system that we should copy for preventing the use of content in training AI.

I don't think that's what OP is envisioning based on their post!

◧◩
4. Burnin+d4[view] [source] [discussion] 2023-05-10 13:30:59
>>samwil+e1
OP is trying to give helpful info to the AI, not set boundaries for it.
◧◩
5. capabl+m4[view] [source] [discussion] 2023-05-10 13:31:40
>>samwil+e1
> But my point is that it's not a system that we should copy for preventing the use of content in training AI

The purpose OP is suggesting in the submission is the opposite, help AI crawlers to understand what the page/website is about without actually having to infer the purpose from the content itself.

replies(1): >>Xelyne+t8
◧◩
6. jefftk+L5[view] [source] [discussion] 2023-05-10 13:38:33
>>samwil+e1
> It's not a system that we should copy for preventing the use of content in training AI

I don't see the OP saying anything about "ai.txt" being for that? They're advocating it as a way that AIs could use fewer tokens to understand what a site is about.

(Which I also don't think is a good idea, since we already have lots of ways of including structured metadata in pages, but the main problem is not that crawlers would ignore it.)

replies(1): >>kmoser+DC
7. bileka+m8[view] [source] 2023-05-10 13:50:30
>>brooks+(OP)
> Failing to solve every problem does not mean a solution is a failure.

There is something to be said though to OP's point where it's actually better to do nothing than an AI.txt because it can give a false sense of security, which is obviously not what you want.

replies(1): >>lelant+tP
◧◩◪
8. Xelyne+t8[view] [source] [discussion] 2023-05-10 13:51:25
>>capabl+m4
Isn't that the entire point of the semantic web?
replies(1): >>kmoser+5D
9. usrusr+W9[view] [source] 2023-05-10 13:57:30
>>brooks+(OP)
That's still not an argument to introduce ai.txt, because everything a hypothetical ai.txt could ever do is already done just as good (or not) by the robots.txt we have. If a training data crawler ignores robots.txt it won't bother checking for an ai.txt either.

And if you feel like rolling out the "welcome friend!" doormat to a particular training data crawler, you are free to dedicate as detailed a robots.txt block as you like to its user agent header of choice. No new conventions needed, everything is already on place.

replies(3): >>michae+Jq >>irobet+2v >>joshua+ty1
10. vlunkr+2n[view] [source] 2023-05-10 14:53:10
>>brooks+(OP)
I know it's getting pedantic, but sunscreen and seatbelts are a poor analogy. They do offer protection if you use them. robots.txt only offers protection if other people/robots choose to care about them.
◧◩
11. michae+Jq[view] [source] [discussion] 2023-05-10 15:08:53
>>usrusr+W9
This seems to be assuming a very different purpose for ai.txt than the OP proposed. It sounds like they are intending ai.txt to give useful contextual information to crawlers collecting AI training data. Robots.txt does not have any of this information (although I suppose you could include it in comments).
◧◩
12. irobet+2v[view] [source] [discussion] 2023-05-10 15:24:57
>>usrusr+W9
worse, ai.txt could become an adversarial vector for attempts to trick the AI into filing your information under some semantic concept
◧◩◪
13. kmoser+DC[view] [source] [discussion] 2023-05-10 15:57:03
>>jefftk+L5
Not only do we already have lots of ways of including structured metadata, but if you want to include directives about what should/shouldn't be scraped and by whom, we already have robots.txt.

In other words, there's no need to create an ai.txt when the robots.txt standard can just be extended.

◧◩◪◨
14. kmoser+5D[view] [source] [discussion] 2023-05-10 15:59:01
>>Xelyne+t8
If only there was an HTML tag that let you provide a concise description of the page content. Perhaps something like <meta name="description" content="This is an example of a meta description. This will often show up in search results.">
◧◩
15. lelant+tP[view] [source] [discussion] 2023-05-10 16:53:27
>>bileka+m8
The point of an ai.txt is that it signals intention of the copyright holder.

Anytime a business is caught using that content, they can't claim that they used publicly available information, because the ai.txt specifically signalled to everyone in a clear and unambiguous manner that the copyright granted by viewing the page is witheld from ai training.

◧◩
16. joshua+ty1[view] [source] [discussion] 2023-05-10 20:11:28
>>usrusr+W9
I do think that robots.txt is pretty useful. If I want my content indexed, I can help the engine find my content. If indexing my content is counterproductive, then I can ask that it be skipped. So it helps the align my interests with the search engine; I can expose my content or I can help the engine avoid wasting resources indexing something that I don't want it to see.

It would also be useful to distinguish training crawlers from indexing crawlers. Maybe I'm publishing personal content. It's useful for me to have it indexed for search, but I don't want an AI to be able to simulate me or my style.

[go to top]