zlacker

[parent] [thread] 12 comments
1. dbette+(OP)[view] [source] 2023-07-08 06:04:58
I notice they don't actually give a good reason that robots.txt isn't suitable.

Change for the sake of it?

replies(6): >>stromb+5 >>Animat+7 >>vore+t >>helsin+11 >>h1fra+Ab1 >>JohnFe+EU6
2. stromb+5[view] [source] 2023-07-08 06:05:28
>>dbette+(OP)
AI!
replies(1): >>asudos+n1
3. Animat+7[view] [source] 2023-07-08 06:05:58
>>dbette+(OP)
It doesn't require signing up with Google.
replies(1): >>0x073+Q5
4. vore+t[view] [source] 2023-07-08 06:09:15
>>dbette+(OP)
To steelman this maybe, I think they’re angling for something like a mechanism to indicate content is OK to index but not OK to use as AI training data. Maybe you could fudge it today with user agents in robots.txt but who knows what the concrete idea of this is.
replies(2): >>varenc+L3 >>Aerroo+U9
5. helsin+11[view] [source] 2023-07-08 06:16:15
>>dbette+(OP)
> I notice they don't actually give a good reason that robots.txt isn't suitable

It's kind of implied: specifying sitemaps/allowance/copyright for different use cases: search, scraping, republishing, training etc. and perhaps adding some of the non standard extensions: Crawl-delay, default host, even sitemap isn't part of the robots.txt standard

> We believe it’s time for the web and AI communities to explore additional machine-readable means for web publisher choice and control for emerging AI and research use cases.

◧◩
6. asudos+n1[view] [source] [discussion] 2023-07-08 06:19:05
>>stromb+5
Indeed, AI is called out directly. Google’s laying groundwork for their own version of a regulatory moat.
◧◩
7. varenc+L3[view] [source] [discussion] 2023-07-08 06:50:33
>>vore+t
robots.txt is already outmoded. It only can indicate that content can’t be crawled but a URL marked this way can still be indexed. As Google says “it is not a mechanism for keeping a web page out of Google” [0] You need to use other things besides robots.txt to preventing indexing.

[0] https://developers.google.com/search/docs/crawling-indexing/...

replies(1): >>dazc+v6
◧◩
8. 0x073+Q5[view] [source] [discussion] 2023-07-08 07:15:52
>>Animat+7
If it would be public the ai could read it and can develop countermeasures ;) .
◧◩◪
9. dazc+v6[view] [source] [discussion] 2023-07-08 07:25:16
>>varenc+L3
Indeed, having pages indexed which can't then be crawled is a great way of shooting yourself in the foot.
replies(1): >>floomk+h61
◧◩
10. Aerroo+U9[view] [source] [discussion] 2023-07-08 08:04:43
>>vore+t
This seems weird to me though, aren't search engines something very similar to AI, if not outright AI?
◧◩◪◨
11. floomk+h61[view] [source] [discussion] 2023-07-08 16:32:31
>>dazc+v6
I think you meant it's a great way for google to punish you for not giving them full access
12. h1fra+Ab1[view] [source] 2023-07-08 17:04:29
>>dbette+(OP)
Came here to say that, seems like nobody as the answer :/

Maybe they want to have finer details on page content, e.g: "you can index those pages but not those nodes" or "those nodes are also AI generated please ignore".

Otherwise I don't know, robots.txt is not sexy but definitely does the job.

13. JohnFe+EU6[view] [source] 2023-07-10 16:05:53
>>dbette+(OP)
I think robots.txt isn't suitable for this for the same reason it's not suitable for keeping other bots from crawling your site: adhering to what robots.txt says is optional, and plenty of bots opt to ignore it.
[go to top]