zlacker

I have no insight, but I suspect it's a question of context: regular old search is about whether a page is indexed or not. Either a URL is part of the index or it isn't. But with AI, there's important questions about what's in those urls.

I think Google is probably thinking hard about the problem of training AI: you don't want to train on the output of other AI. That doesn't mean the content shouldn't be processed, just that it shouldn't be used for training. Or maybe it's worth noting that some content is derived from other content that you've manually produced, versus content derived from the content of third parties.

Said another way, I expect that Google isn't just implementing a new allowlist/denylist. It's likely about exposing new information about content.

replies(1): >>2OEH8e+Li

>>bastaw+(OP)
Cool. Sounds like a you problem, you meaning crawlers and AI trainers. Now it will fall on every web developer to tag their data for it to be exploited by megacorps?

Now that I think of it- why do we put up with robots.txt at all?

> A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests

If someone overloads your site with automated requests how is that not criminal? Why aren't they liable?

replies(5): >>judge2+ij >>dasil0+Vl >>thephy+VL >>wickof+Dk1 >>bastaw+Ry6

>>2OEH8e+Li
You can block the crawler on your entire site. I’m not sure it’s true that it’s primarily used “to avoid overloading your site”.

replies(1): >>blacks+yu

>>2OEH8e+Li
I don't understand what you have against robots.txt. It's just a way to signal what you want crawlers to do on your site. It's not complicated or mandatory, but it gives you a way to influence how your site is accessed. I'm not sure why you would jump straight to litigation as a better solution—that solves a much smaller set of problems at a much higher cost.

>>judge2+ij
For sure, since those directives in your robots.txt don't actually compel the crawlers to do anything. They're more like a polite request, and plenty of bots ignore or 'accidentally' overstep them. I do think they have still some value, not just as a handy list of high-value targets - you may know that some part of your site has a bunch of similar links that it doesn't make sense to crawl or index (though there's always norel/nofollow...), or that some pages (/account/preferences etc.) just don't make sense for bots to be visiting. The general idea of extending the standard to cover training AI isn't a terrible idea, but it does seem like too little, too late.

replies(1): >>lakome+Ez1

>>2OEH8e+Li
> If someone overloads your site with automated requests how is that not criminal? Why aren't they liable?

Criminal requires a specific law in the criminal code be intentionally broken.

There is a world of difference between an intentional DoS and a crawler adding some marginal traffic to a server then backing off when the server responses fail.

>>2OEH8e+Li
Proposing to jail people for doing http requests to publicly available resources on a hacker forum?

>>blacks+yu
robots.txt tells the search engine which content is relevant. That's all.

>>2OEH8e+Li
> Now it will fall on every web developer to tag their data for it to be exploited by megacorps?

If Google says they'll delist your site if they detect AI generated content that you haven't declared, that's also a you problem (you meaning webmasters). It's a bit silly to think that it's a purely one way relationship. You're more than welcome to block Google from indexing your site (trivially!) and they're welcome to not include you in their service for not following their guidelines.