zlacker

[return to "Google to explore alternatives to robots.txt"]
1. Kwpols+ab[view] [source] 2023-07-08 07:51:20
>>skille+(OP)
Why would AI need a new standard for excluding it? Just add a "Googlebot-AI" user agent to your list [0] and respect these rules when crawling content for use in AIs, and convince OpenAI and Bing to do the same.

[0] https://developers.google.com/search/docs/crawling-indexing/...

◧◩
2. bastaw+CL[view] [source] 2023-07-08 14:07:42
>>Kwpols+ab
I have no insight, but I suspect it's a question of context: regular old search is about whether a page is indexed or not. Either a URL is part of the index or it isn't. But with AI, there's important questions about what's in those urls.

I think Google is probably thinking hard about the problem of training AI: you don't want to train on the output of other AI. That doesn't mean the content shouldn't be processed, just that it shouldn't be used for training. Or maybe it's worth noting that some content is derived from other content that you've manually produced, versus content derived from the content of third parties.

Said another way, I expect that Google isn't just implementing a new allowlist/denylist. It's likely about exposing new information about content.

◧◩◪
3. 2OEH8e+n41[view] [source] 2023-07-08 16:10:00
>>bastaw+CL
Cool. Sounds like a you problem, you meaning crawlers and AI trainers. Now it will fall on every web developer to tag their data for it to be exploited by megacorps?

Now that I think of it- why do we put up with robots.txt at all?

> A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests

If someone overloads your site with automated requests how is that not criminal? Why aren't they liable?

◧◩◪◨
4. dasil0+x71[view] [source] 2023-07-08 16:26:57
>>2OEH8e+n41
I don't understand what you have against robots.txt. It's just a way to signal what you want crawlers to do on your site. It's not complicated or mandatory, but it gives you a way to influence how your site is accessed. I'm not sure why you would jump straight to litigation as a better solution—that solves a much smaller set of problems at a much higher cost.
[go to top]