zlacker

Google to explore alternatives to robots.txt

submitted by skille+(OP) on 2023-07-08 05:30:12 | 116 points 110 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
8. stonog+P4[view] [source] 2023-07-08 06:30:48
>>skille+(OP)
Given the method they decided on for people to opt out of wifi access point scanning -- requiring the rest of the world to change[1], while they continue doing whatever the hell they want -- I expect you'll need to log in to a Google account and write a brief essay about why your content shouldn't belong to them.

1 - https://support.google.com/maps/answer/1725632?hl=en#zippy=%...

◧◩◪
9. varenc+f6[view] [source] [discussion] 2023-07-08 06:50:33
>>vore+X2
robots.txt is already outmoded. It only can indicate that content can’t be crawled but a URL marked this way can still be indexed. As Google says “it is not a mechanism for keeping a web page out of Google” [0] You need to use other things besides robots.txt to preventing indexing.

[0] https://developers.google.com/search/docs/crawling-indexing/...

◧◩
19. berkle+da[view] [source] [discussion] 2023-07-08 07:41:02
>>dazc+g9
> bad actors

I prefer the term ‘Chad third-party scraper’ [1]

https://pbs.twimg.com/media/FxkeJmjakAENFI8?format=jpg&name=...

22. Kwpols+ab[view] [source] 2023-07-08 07:51:20
>>skille+(OP)
Why would AI need a new standard for excluding it? Just add a "Googlebot-AI" user agent to your list [0] and respect these rules when crawling content for use in AIs, and convince OpenAI and Bing to do the same.

[0] https://developers.google.com/search/docs/crawling-indexing/...

◧◩
30. masswe+Fc[view] [source] [discussion] 2023-07-08 08:07:48
>>konsch+ub
Mind that there are already countries regulating what may be in published photos and what may be not. (E.g., the Eiffel Tower illuminated is protected: https://www.toureiffel.paris/en/business/use-image-of-eiffel...)

(Edit: How is a factual, on-topic statement, providing a source-link for its claim, downvoted? You may not favor these regulations, but they still do exist.)

66. tikkun+LR[view] [source] 2023-07-08 14:48:34
>>skille+(OP)
See also: https://content.getsphere.com/
75. pentag+s41[view] [source] 2023-07-08 16:10:28
>>skille+(OP)
The other day I was trying to search a comment of a YouTube video that I remembered, but wasn't able to find it with the google search "site:youtube.com [phrase of the comment]", later to find out that YouTube disallows search engines to index comments trough robots.txt https://www.youtube.com/robots.txt

>Disallow: /comment

So I guess that works for them.

95. westur+ZJ1[view] [source] 2023-07-08 20:19:02
>>skille+(OP)
There are a number of opportunities to solve for carbon.txt, security.txt, content licenses, indication of [AI] provenance, and do better than robots.txt; hopefully with JSON-LD Linked Data.

> >>35888037 : security.txt, carbon.txt, SPDX SBOM, OSV, JSON-LD, blockcerts

"Google will label fake images created with its A.I" (re: IPTC, Schema org JSON-LD" (2023) >>35896000

From "Tell HN: We should start to add “ai.txt” as we do for “robots.txt”" (2023) >>35888037 :

> How many parsers should be necessary for https://schema.org/CreativeWork https://schema.org/license metadata for resources with (Linked Data) URIs?

[go to top]