Google to explore alternatives to robots.txt

submitted by skille+(OP) on 2023-07-08 05:30:12 | 116 points 110 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts

>>skille+(OP)
Given the method they decided on for people to opt out of wifi access point scanning -- requiring the rest of the world to change[1], while they continue doing whatever the hell they want -- I expect you'll need to log in to a Google account and write a brief essay about why your content shouldn't belong to them.

1 - https://support.google.com/maps/answer/1725632?hl=en#zippy=%...

>>vore+X2
robots.txt is already outmoded. It only can indicate that content can’t be crawled but a URL marked this way can still be indexed. As Google says “it is not a mechanism for keeping a web page out of Google” [0] You need to use other things besides robots.txt to preventing indexing.

[0] https://developers.google.com/search/docs/crawling-indexing/...

>>dazc+g9
> bad actors

I prefer the term ‘Chad third-party scraper’ [1]

https://pbs.twimg.com/media/FxkeJmjakAENFI8?format=jpg&name=...

>>skille+(OP)
Why would AI need a new standard for excluding it? Just add a "Googlebot-AI" user agent to your list [0] and respect these rules when crawling content for use in AIs, and convince OpenAI and Bing to do the same.

[0] https://developers.google.com/search/docs/crawling-indexing/...

>>konsch+ub
Mind that there are already countries regulating what may be in published photos and what may be not. (E.g., the Eiffel Tower illuminated is protected: https://www.toureiffel.paris/en/business/use-image-of-eiffel...)

(Edit: How is a factual, on-topic statement, providing a source-link for its claim, downvoted? You may not favor these regulations, but they still do exist.)

>>skille+(OP)
See also: https://content.getsphere.com/

>>skille+(OP)
The other day I was trying to search a comment of a YouTube video that I remembered, but wasn't able to find it with the google search "site:youtube.com [phrase of the comment]", later to find out that YouTube disallows search engines to index comments trough robots.txt https://www.youtube.com/robots.txt

>Disallow: /comment

So I guess that works for them.

>>skille+(OP)
There are a number of opportunities to solve for carbon.txt, security.txt, content licenses, indication of [AI] provenance, and do better than robots.txt; hopefully with JSON-LD Linked Data.

> >>35888037 : security.txt, carbon.txt, SPDX SBOM, OSV, JSON-LD, blockcerts

"Google will label fake images created with its A.I" (re: IPTC, Schema org JSON-LD" (2023) >>35896000

From "Tell HN: We should start to add “ai.txt” as we do for “robots.txt”" (2023) >>35888037 :

> How many parsers should be necessary for https://schema.org/CreativeWork https://schema.org/license metadata for resources with (Linked Data) URIs?

zlacker

Google to explore alternatives to robots.txt