zlacker

Google to explore alternatives to robots.txt

submitted by skille+(OP) on 2023-07-08 05:30:12 | 116 points 110 comments
[view article] [source] [links] [go to bottom]
replies(25): >>dbette+u2 >>voytec+V3 >>stonog+P4 >>tannha+E6 >>dazc+g9 >>masswe+v9 >>blackl+X9 >>Kwpols+ab >>konsch+ub >>foreig+Bd >>xg15+2k >>sys_64+Xw >>revski+1K >>transf+TP >>tikkun+LR >>mrkram+TS >>pentag+s41 >>thayne+l81 >>denton+bo1 >>margin+fA1 >>westur+ZJ1 >>activi+vY1 >>lakome+0l2 >>mindcr+VN2 >>JohnFe+cW6
1. dbette+u2[view] [source] 2023-07-08 06:04:58
>>skille+(OP)
I notice they don't actually give a good reason that robots.txt isn't suitable.

Change for the sake of it?

replies(6): >>stromb+z2 >>Animat+B2 >>vore+X2 >>helsin+v3 >>h1fra+4e1 >>JohnFe+8X6
◧◩
2. stromb+z2[view] [source] [discussion] 2023-07-08 06:05:28
>>dbette+u2
AI!
replies(1): >>asudos+R3
◧◩
3. Animat+B2[view] [source] [discussion] 2023-07-08 06:05:58
>>dbette+u2
It doesn't require signing up with Google.
replies(1): >>0x073+k8
◧◩
4. vore+X2[view] [source] [discussion] 2023-07-08 06:09:15
>>dbette+u2
To steelman this maybe, I think they’re angling for something like a mechanism to indicate content is OK to index but not OK to use as AI training data. Maybe you could fudge it today with user agents in robots.txt but who knows what the concrete idea of this is.
replies(2): >>varenc+f6 >>Aerroo+oc
◧◩
5. helsin+v3[view] [source] [discussion] 2023-07-08 06:16:15
>>dbette+u2
> I notice they don't actually give a good reason that robots.txt isn't suitable

It's kind of implied: specifying sitemaps/allowance/copyright for different use cases: search, scraping, republishing, training etc. and perhaps adding some of the non standard extensions: Crawl-delay, default host, even sitemap isn't part of the robots.txt standard

> We believe it’s time for the web and AI communities to explore additional machine-readable means for web publisher choice and control for emerging AI and research use cases.

◧◩◪
6. asudos+R3[view] [source] [discussion] 2023-07-08 06:19:05
>>stromb+z2
Indeed, AI is called out directly. Google’s laying groundwork for their own version of a regulatory moat.
7. voytec+V3[view] [source] 2023-07-08 06:20:02
>>skille+(OP)
Seems like it's intended for content stealing from every place that doesn't immediately implement Google's New Web Order as an addition to robots.txt.

"Your do not enter sign uses font we don't like, so we'll just ignore it"

replies(3): >>saagar+T7 >>Ferret+fh >>LinuxB+UC
8. stonog+P4[view] [source] 2023-07-08 06:30:48
>>skille+(OP)
Given the method they decided on for people to opt out of wifi access point scanning -- requiring the rest of the world to change[1], while they continue doing whatever the hell they want -- I expect you'll need to log in to a Google account and write a brief essay about why your content shouldn't belong to them.

1 - https://support.google.com/maps/answer/1725632?hl=en#zippy=%...

replies(1): >>okeuro+s8
◧◩◪
9. varenc+f6[view] [source] [discussion] 2023-07-08 06:50:33
>>vore+X2
robots.txt is already outmoded. It only can indicate that content can’t be crawled but a URL marked this way can still be indexed. As Google says “it is not a mechanism for keeping a web page out of Google” [0] You need to use other things besides robots.txt to preventing indexing.

[0] https://developers.google.com/search/docs/crawling-indexing/...

replies(1): >>dazc+Z8
10. tannha+E6[view] [source] 2023-07-08 06:55:56
>>skille+(OP)
Google also introduced XML sitemaps (and noindex) so technically I could see how robots.txt could be consolidated into sitemapindex.xml with additional attributes. But it's not clear why and why now. Are there new requirements (from Google's PoV), such as tagging content as machine-generated? Not sure opening this discussion and changing the legal status of whether something may be crawled is going to end well for Google ie. from a copyright perspective (requiring explicit and individual consent), and in particular by entering legal terra incognita of moral rights vs generative AI, but maybe blocking competitors/AI startups is what they're after?
◧◩
11. saagar+T7[view] [source] [discussion] 2023-07-08 07:10:36
>>voytec+V3
What makes you think this? Why do you think Google actually cares about your sign if all they want to do is steal from you?
replies(1): >>oneeye+la
◧◩◪
12. 0x073+k8[view] [source] [discussion] 2023-07-08 07:15:52
>>Animat+B2
If it would be public the ai could read it and can develop countermeasures ;) .
◧◩
13. okeuro+s8[view] [source] [discussion] 2023-07-08 07:18:22
>>stonog+P4
"To opt out, change the SSID (name) of your Wi-Fi access point (your wireless network name) so that it ends with "_nomap.""

Wow, that's absurd. It would have been better not to have any mechanism at all.

◧◩◪◨
14. dazc+Z8[view] [source] [discussion] 2023-07-08 07:25:16
>>varenc+f6
Indeed, having pages indexed which can't then be crawled is a great way of shooting yourself in the foot.
replies(1): >>floomk+L81
15. dazc+g9[view] [source] 2023-07-08 07:28:04
>>skille+(OP)
Since robots.txt is currently ignored by bad actors, why would any alternative be better?
replies(2): >>tjpnz+T9 >>berkle+da
16. masswe+v9[view] [source] 2023-07-08 07:30:55
>>skille+(OP)
May I suggest a more general "harvest.txt" for all purposes of scraping content?

Edit: Alternatively, have a "Harvest" section in "robots.txt", using the same established syntax and semantics. This may come with the advantage of making it clear that agents should default to the general "robots.txt" rules in absence of any such rules. Moreover, existing content management systems will already provide means for maintaining "robots.txt" and there's no need to update those. (We may also introduce an "Index" section for the established purpose of "robots.txt", with any bare, untitled rules defaulting to this, thus providing compatibility.)

Example:

  #file "robots.txt"

  Index # optional section heading (maybe useful for switching context)
  User-agent: *
  Allow: /
  Disallow: /test/
  Disallow: /private/
  
  User-agent: Badbot
  Disallow: /
  
  Harvest # additional rules for scraping
  User-agent: *
  Disallow: /blog/
  Disallow: /protected-artwork/
◧◩
17. tjpnz+T9[view] [source] [discussion] 2023-07-08 07:35:58
>>dazc+g9
Because Google can start ripping off content in the open with impunity.
18. blackl+X9[view] [source] 2023-07-08 07:36:27
>>skille+(OP)
Why are those folks trying to sprinkle AI over everything, even when it's completely inappropriate?

There's no AI involved in web crawling. If you come to my site, I'll tell you which pages you can visit/index, and which pages you can't, end of the story

Yes, there are security concerns with people putting /very-secret-admin-panel in their robots.txt and letting malicious actors know what URLs they should target. But if /very-secret-admin-panel is never linked by any public page, then the bot won't encounter it, therefore this stuff should never belong to robots.txt.

Please keep it as straightforward as this and don't add any AI bullshit to one of the few remaining simple processes in web development and administration.

replies(2): >>iamphi+Ca >>simion+me
◧◩
19. berkle+da[view] [source] [discussion] 2023-07-08 07:41:02
>>dazc+g9
> bad actors

I prefer the term ‘Chad third-party scraper’ [1]

https://pbs.twimg.com/media/FxkeJmjakAENFI8?format=jpg&name=...

◧◩◪
20. oneeye+la[view] [source] [discussion] 2023-07-08 07:42:20
>>saagar+T7
IIRC, Google has precedent on this - e.g. scanning full books for search unless the owner explicitly refused.
replies(2): >>dylan6+9c >>411111+Cp
◧◩
21. iamphi+Ca[view] [source] [discussion] 2023-07-08 07:45:10
>>blackl+X9
Perhaps they’re intending on a means to say whether your content can be used within an AI training model or not.
replies(1): >>denton+tt1
22. Kwpols+ab[view] [source] 2023-07-08 07:51:20
>>skille+(OP)
Why would AI need a new standard for excluding it? Just add a "Googlebot-AI" user agent to your list [0] and respect these rules when crawling content for use in AIs, and convince OpenAI and Bing to do the same.

[0] https://developers.google.com/search/docs/crawling-indexing/...

replies(3): >>bastaw+CL >>lokar+MR >>JohnFe+jW6
23. konsch+ub[view] [source] 2023-07-08 07:55:34
>>skille+(OP)
If you run a store on Main Street,

should you get to decide if people can take pictures of your store?

replies(9): >>inglor+Kb >>riffra+Rb >>manuel+tc >>masswe+Fc >>mkl95+Kc >>simonj+xd >>ehnto+ie >>gumbal+pe >>kmbfjr+kf1
◧◩
24. inglor+Kb[view] [source] [discussion] 2023-07-08 07:58:42
>>konsch+ub
If someone goes to your store on Main Street, takes your inventory list and the price list, does the same to all the competitors and then sets up shop in main street selling people the information of where they should buy there is a case for me to be able to decide whether or not I give them access to that information.

Especially since they're letting stores pay money to be the first recommended store.

replies(1): >>Aerroo+bc
◧◩
25. riffra+Rb[view] [source] [discussion] 2023-07-08 08:00:13
>>konsch+ub
If the shop owner is losing something when people take pictures of the shop, they may start to shoo then away every time.

Robots.txt exists because shop photographers want to be allowed to take pictures rather than be blocked tout court.

◧◩◪◨
26. dylan6+9c[view] [source] [discussion] 2023-07-08 08:02:30
>>oneeye+la
They are the ultimate ask for forgiveness rather than permission. Copyright has been a thing for a long long time before googs developed their scanning. They were well aware that it should have been an opt-in, but knew they’d never gain traction for their little project. So they bull in a China shop’d their way to a point of too far to stop them.
replies(3): >>philip+oK >>remus+FT >>extra8+r52
◧◩◪
27. Aerroo+bc[view] [source] [discussion] 2023-07-08 08:03:00
>>inglor+Kb
Isn't this what gas stations do?
◧◩◪
28. Aerroo+oc[view] [source] [discussion] 2023-07-08 08:04:43
>>vore+X2
This seems weird to me though, aren't search engines something very similar to AI, if not outright AI?
◧◩
29. manuel+tc[view] [source] [discussion] 2023-07-08 08:05:33
>>konsch+ub
If you run a store on Main Street, should people be allowed to take pictures of your store, copy its content and put it up for sale on another store?

I see this argument made over and over again here on HN and it’s puzzling that people always stop at the first part.

Companies won’t stop at the “look at your content” phase. They will use the knowledge gathered by looking at your content to do something else. That’s the problematic part.

replies(3): >>safety+dd >>konsch+od >>chii+Mj
◧◩
30. masswe+Fc[view] [source] [discussion] 2023-07-08 08:07:48
>>konsch+ub
Mind that there are already countries regulating what may be in published photos and what may be not. (E.g., the Eiffel Tower illuminated is protected: https://www.toureiffel.paris/en/business/use-image-of-eiffel...)

(Edit: How is a factual, on-topic statement, providing a source-link for its claim, downvoted? You may not favor these regulations, but they still do exist.)

◧◩
31. mkl95+Kc[view] [source] [discussion] 2023-07-08 08:08:36
>>konsch+ub
You will get a lot of replies focused on scraping. Which is exactly what Google do for a living. It's in their best interest to be the only company obtaining that valuable data.
◧◩◪
32. safety+dd[view] [source] [discussion] 2023-07-08 08:14:46
>>manuel+tc
...Yes?

Retail companies research what other retail companies are doing and copy them all the time... was the answer supposed to be no here?

replies(3): >>nxpnsv+nd >>rat998+td >>manuel+ef
◧◩◪◨
33. nxpnsv+nd[view] [source] [discussion] 2023-07-08 08:17:07
>>safety+dd
Often they have signs forbidding you to take photos in stores… i guess that’s a bit like robots.txt
replies(1): >>delfin+Pl1
◧◩◪
34. konsch+od[view] [source] [discussion] 2023-07-08 08:17:14
>>manuel+tc
I don't think that's problematic. That's how societies work. They learn.
replies(2): >>gumbal+Be >>manuel+of
◧◩◪◨
35. rat998+td[view] [source] [discussion] 2023-07-08 08:18:25
>>safety+dd
It is a no depending on what you sell. If they sell original pictures, you cannot copy them. You are allowed to sell the same products, but not to copy them.
replies(1): >>safety+if
◧◩
36. simonj+xd[view] [source] [discussion] 2023-07-08 08:19:06
>>konsch+ub
It's an interesting "have your cake and eat it" debate. Where is the line? Is there a line? How should you decide, and how should anyone else decide?

I find this debate very aligned to copyright debates.

37. foreig+Bd[view] [source] 2023-07-08 08:19:19
>>skille+(OP)
WARNING: training data moat building in progress.

They want to introduce a line in robots.txt that says "not for training AI". So nobody else can use publci data to train their AI. They already did.

replies(1): >>isodev+8e
◧◩
38. isodev+8e[view] [source] [discussion] 2023-07-08 08:24:41
>>foreig+Bd
An ad company is keen to explore alternative means of controlling what other ad companies can do... After said ad company already scraped what they needed with no regard if it was even OK to do it.
◧◩
39. ehnto+ie[view] [source] [discussion] 2023-07-08 08:27:04
>>konsch+ub
This analogy doesn't map very well. It's a clearly different medium.

The value of a store is the ability to buy products from it, you taking a photo of it doesn't impact that transaction of value that at all. The value of content online is the very act of reading it/consuming it.

A scraper is getting a free lunch, that is clear. They are trading nothing for something, and as the owner of the something that is not the deal I have chosen to make.

◧◩
40. simion+me[view] [source] [discussion] 2023-07-08 08:27:13
>>blackl+X9
Maybe some websites would like to specify something more then "I allow everything", maybe you could specify a license for the data on the page, like if you are OK for using it in open source research, open source AI training but not allow the data to be used in proprietary AI, or you do not want any kind of AI/research on the data , only search indexing.
◧◩
41. gumbal+pe[view] [source] [discussion] 2023-07-08 08:28:02
>>konsch+ub
It’s more like people stealing your products because its out in the main street rather than taking pictures. IP theft has real life implications. Even taking pictures has if taken by thieves planning a crime.
◧◩◪◨
42. gumbal+Be[view] [source] [discussion] 2023-07-08 08:30:53
>>konsch+od
AI is not “societies” or “people” and it most certainly doesnt “learn” as the two would. Perhaps thats what openai’s effective marketing campaign taught gullible folks but that’s not how it works at all. A”I” ingests massive amounts of people’s intellectual work, often without consent, mixes it and resells it without royalties.
replies(1): >>chii+Wj
◧◩◪◨
43. manuel+ef[view] [source] [discussion] 2023-07-08 08:39:11
>>safety+dd
And is your point that that’s ok?
replies(1): >>safety+Bg
◧◩◪◨⬒
44. safety+if[view] [source] [discussion] 2023-07-08 08:39:29
>>rat998+td
You can take a photo of someone else's copyrighted picture (photo, art, whatever). Or any other merchandise they're selling. You can even do it while you're on their property, standing next to a sign that says no photos allowed. All legal.

The business has the right to ask you to leave if you violate their policies. In fact, they can ask you to leave for (almost) any reason at all. They may have some limited right to remove you using a reasonable amount of force, depending on the jurisdiction.

Once you've left or been removed from their property, you still have the legal right to take photos of it from the public place you're now standing in. If you can view the photos or are they're selling through their window, you can keep taking photos of it.

They don't have the right to confiscate your camera or the pictures you took. Your rights in terms of what you can do with those photos may have limitations (e.g. redistribution, reproduction), particularly if you photographed copyrighted works.

This is why the parent's comment confused me so much. In most of the world you live in a society where yeah you have the freedom to take photos of stuff, or copy it down on a clipboard or whatever, and use it as competitive intelligence to improve your own business. And thousands of businesses are doing it every day.

replies(1): >>manuel+qg
◧◩◪◨
45. manuel+of[view] [source] [discussion] 2023-07-08 08:41:01
>>konsch+od
“How societies work” can be used to justify essentially everything and I do not think it’s a good argument.
◧◩◪◨⬒⬓
46. manuel+qg[view] [source] [discussion] 2023-07-08 08:54:36
>>safety+if
Everything you wrote ignores the fact that this content taken from websites are not just parked there to be used as “competitive intelligence”

It becomes integral part of a business product. That is the problematic part.

You going into a store and take pictures of some art to use as a reference material is not an issue.

But if you take those pictures and you use them to make a program that than spits out new art that is just a mix of those images patched together then, imo, that’s an issue.

replies(1): >>safety+ti
◧◩◪◨⬒
47. safety+Bg[view] [source] [discussion] 2023-07-08 08:55:59
>>manuel+ef
Maybe I am not understanding your point?

Of course it's OK to take note of what stock is on a store's shelf, go back to your own business, and sell the same stock. It's also ubiquitous. It is de facto practiced globally by everyone, it's generally legal, and it's morally fine. Broadly speaking we call this competitive intelligence or market intelligence.

replies(1): >>manuel+Xh
◧◩
48. Ferret+fh[view] [source] [discussion] 2023-07-08 09:04:59
>>voytec+V3
To be clear, robots.txt is not legally binding, Google is not bound to follow it, and in fact I believe that Google doesn't follow it and hasn't for a very long time, for the simple reason that many sites' robots.txt file is wrong.

The intent of robots.txt is to help crawlers, for example, to keep crawlers from getting stuck in a recursive loop of dynamic pages, or from crawling pages with no value. robots.txt is not for banning, restricting, or hindering crawlers.

replies(3): >>superk+uS >>lisasa+G31 >>floomk+M71
◧◩◪◨⬒⬓
49. manuel+Xh[view] [source] [discussion] 2023-07-08 09:14:51
>>safety+Bg
My point is that these analogies fail to capture the actual reality of AI products and they relationship with source content.

The source content is part of the AI product. There is no AI product without the source content.

This is not you going to a store and see what they sell and adjust your offering. You have no offering without the original store’s content.

◧◩◪◨⬒⬓⬔
50. safety+ti[view] [source] [discussion] 2023-07-08 09:20:04
>>manuel+qg
It sounds to me like we agree. With respect, people have a lot more rights than they realize when it comes to taking photos of stuff in public (or semi-public) places, which is the scenario in your analogy. But this has questionable bearing on whether an AI can scoop up Internet content and do something with it.

I think it's almost a guarantee that courts will start finding exact AI reproductions of copyrighted work to be infringement.

Where the analogy might come into play is that if you take a photo of a copyrighted work there are limitations on what you can do with your photo, without infringing on that copyright. I have no idea if the courts will apply that stuff to AI, for instance there's actually a fair bit of leeway if you take a photo which contains only a portion of a copyrighted work and then you want to sell or redistribute that photo. One might argue that this legal principle applies to AI as well... lawyers are already having a field day with this stuff I'm sure.

replies(1): >>Spivak+bQ
◧◩◪
51. chii+Mj[view] [source] [discussion] 2023-07-08 09:35:06
>>manuel+tc
> copy its content and put it up for sale on another store?

they aren't copying the content. They are learning off the content, and produce more like it but not a copy.

replies(1): >>denton+Jr1
◧◩◪◨⬒
52. chii+Wj[view] [source] [discussion] 2023-07-08 09:36:57
>>gumbal+Be
> ingests massive amounts of people’s intellectual work, often without consent, mixes it and resells it without royalties.

but when people do that, it is allowed isnt it? So what is special about AI, other than the scale?

replies(1): >>gumbal+Ik
53. xg15+2k[view] [source] 2023-07-08 09:38:07
>>skille+(OP)
This press release really contains no substantial information (except the signup form), but the amount of "corpospeak words" in there that are usually euphemisms for bad news frankly worries me.

In particular those bits:

> A principled approach to evolving choice and control for web content

> We believe everyone benefits from a vibrant content ecosystem. Key to that is web publishers having choice and control over their content, and opportunities to derive value from participating in the web ecosystem. However, we recognize that existing web publisher controls were developed before new AI and research use cases.

> We believe it’s time for the web and AI communities to explore additional machine-readable means for web publisher choice and control for emerging AI and research use cases.

That's an awful lot of talk about "choice", and even more so "evolving choice". That's particularly odd when the choice of most publishers seems to be rather clear: "don't scrape content for AI training without at least asking before" - and robots.txt is perfectly capable of expressing that choice.

So the ones that seem unhappy with the available means of choice seem to be the AI scrapers, not the publishers.

So my preliminary translation of this from corpospeak would be:

"Look guys, we were fine with robots.txt as long as we were only scraping your sites for search indexing.

But now the AI race is on and gathering training data has just become Too Important, so we're planning to ignore robots.txt in the near future and just scrape the entirety of your sites for AI training.

Instead, we'll offer you the choice of whether you want to let us scrape in exchange for some yet-to-be-determined compensation or whether you just provide the data for free. If we're particularly nice, we'll also give you an option to opt-out of scraping altogether. However, this option will be separate from robots.txt and you will have to explicitly add it to your site (provided you get to know about it in the first place)"

That being said, I find robots.txt a bit strange for a target for this. Robots.txt really is nothing - it's not a license and has no legal significance (afaik) and it never prevented scraping on a technical level either. All it did was give friendly scrapers a hint, so they don't accidentally step on the publisher's toes - but it never prevented anyone from intentionally scraping stuff they weren't supposed to.

On the other hand, if some courts did interpret robots.txt as some kind of impromptu licence, that interpretation probably wouldn't change, whether Google likes the standard or not. Also, people who employ real technical measures (ratelimiting, captchas, etc) will probably continue to do too.

So if that's what they're planning to do, my only explanation would be that there is a large amount of small and "low-hanging fruit" sites (probably with inexperienced devs) that don't want to be scraped but really only added a robots.txt to block scrapers and didn't do anything else - and Google is planning to use those for AI training when all the large social networks are increasingly blocking them off now.

◧◩◪◨⬒⬓
54. gumbal+Ik[view] [source] [discussion] 2023-07-08 09:46:38
>>chii+Wj
This debate is becoming tiring - yes, humans are allowed to according to terms and conditions. We could use the same argument in claiming that a database is just human memory at scale, thus it should be allowed to store any data it wants and then serve it, yet we dont permit that. Similarly a laptop can sing because just like a human it emits sound, yet you have to pay for what it emits.

AI is software, it doesnt “learn” as a human does and even if it did it would still have to be bound by the same rules as any other piece of software and human.

replies(1): >>chii+1o
◧◩◪◨⬒⬓⬔
55. chii+1o[view] [source] [discussion] 2023-07-08 10:32:26
>>gumbal+Ik
> it would still have to be bound by the same rules as any other piece of software and human.

exactly, so there's zero reason to prevent anyone from using a piece of software (which slurps a lot of information off the internet), and produce new works that do not break currently copyrighted content.

replies(1): >>gumbal+GL
◧◩◪◨
56. 411111+Cp[view] [source] [discussion] 2023-07-08 10:48:44
>>oneeye+la
Your phrasing makes it sound like that's a negative.

I'm honestly surprised they're required to abstain from doing so at the author's request.

You can only read the context of the match after finding the search result after all, not the whole book.

It's an example of significant overreach of intellectual property from how I see it. The robot.txt rational doesn't apply there either, as their scanning does not impact anyone's resources. And it's been published, which makes it public by definition.

replies(1): >>oneeye+Ox
57. sys_64+Xw[view] [source] 2023-07-08 12:15:46
>>skille+(OP)
How about they invent a new method that is opt-out by default and lack of presence means move along and leave this server alone.
◧◩◪◨⬒
58. oneeye+Ox[view] [source] [discussion] 2023-07-08 12:24:28
>>411111+Cp
Oh, I agree with you. I think the whole idea of legislating against machines accessing public content is a very slippery slope.
◧◩
59. LinuxB+UC[view] [source] [discussion] 2023-07-08 13:08:51
>>voytec+V3
AFAIK the only way to reduce content stealing by bots is to add authentication requirements to a page and to detect if a real persons authentication is being shared by bots then instantly and automatically rotate their password each time that occurs.
60. revski+1K[view] [source] 2023-07-08 13:58:58
>>skille+(OP)
Singing up to know more updates ? No, sorry.
◧◩◪◨⬒
61. philip+oK[view] [source] [discussion] 2023-07-08 14:00:50
>>dylan6+9c
They don't even ask for forgiveness. They are "don't admit you've done anything wrong to begin with."
◧◩
62. bastaw+CL[view] [source] [discussion] 2023-07-08 14:07:42
>>Kwpols+ab
I have no insight, but I suspect it's a question of context: regular old search is about whether a page is indexed or not. Either a URL is part of the index or it isn't. But with AI, there's important questions about what's in those urls.

I think Google is probably thinking hard about the problem of training AI: you don't want to train on the output of other AI. That doesn't mean the content shouldn't be processed, just that it shouldn't be used for training. Or maybe it's worth noting that some content is derived from other content that you've manually produced, versus content derived from the content of third parties.

Said another way, I expect that Google isn't just implementing a new allowlist/denylist. It's likely about exposing new information about content.

replies(1): >>2OEH8e+n41
◧◩◪◨⬒⬓⬔⧯
63. gumbal+GL[view] [source] [discussion] 2023-07-08 14:08:08
>>chii+1o
Well that goes without saying. The issue is not the tool the issue is how its created and used. No problem in using publicly available ai friendly licensed content. The issue is using copyrighted content without consent and without honouring licensing terms.
replies(1): >>chii+gp2
64. transf+TP[view] [source] 2023-07-08 14:36:59
>>skille+(OP)
There are a lot of comment about scraping. But I think that their new standards will try to tag a AI-genererated content.

There is not much point to give crawlers a lot of content generation; but rather only the succinct "prompt".

In that way, it will be easy to signal to crawlers what to crawl, and user can be read the content after the function of LLM...

◧◩◪◨⬒⬓⬔⧯
65. Spivak+bQ[view] [source] [discussion] 2023-07-08 14:38:42
>>safety+ti
> I think it's almost a guarantee that courts will start finding exact AI reproductions of copyrighted work to be infringement.

That was never not true. The difference is that AI can't violate copyright, only humans can. The legal not-so-gray area is whether "spat out by an AI after prompting" is a performance of the work and if so, what human is responsible for the copying.

replies(1): >>Anthon+Q61
66. tikkun+LR[view] [source] 2023-07-08 14:48:34
>>skille+(OP)
See also: https://content.getsphere.com/
◧◩
67. lokar+MR[view] [source] [discussion] 2023-07-08 14:48:38
>>Kwpols+ab
Google just copies everything and saves it. They then use it for various uses. It would be strange to fetch pages several times.
replies(1): >>margin+aZ
◧◩◪
68. superk+uS[view] [source] [discussion] 2023-07-08 14:55:04
>>Ferret+fh
That's just because google is a corporate person who is more equal than a human person. Human persons, at least in the USA, get charged under the CFAA 1030 law if they're using non-browser tools to access the public website of someone with power and/if they happen to rock the boat (like weev w/wget).

That's not to say that I disagree. In most cases robots.txt is not legally binding. It only becomes a legal danger to not follow it when the person running the site has power and can buy a DA to indict you.

replies(2): >>rafark+Ln1 >>TeMPOr+D02
69. mrkram+TS[view] [source] 2023-07-08 14:58:36
>>skille+(OP)
In the first sentence they mention AI, that's how I know it's doomed.
replies(1): >>Mental+A01
◧◩◪◨⬒
70. remus+FT[view] [source] [discussion] 2023-07-08 15:02:55
>>dylan6+9c
Copyright is to do with protecting reproduction of works, no? What google has done here is scanning the book and indexed the content, presumably so it makes it easier for users to search books for relevant material. Assuming they don't reproduce large sections of copyrighted works in their search results I don't feel like they're doing anything wrong here.
replies(1): >>tpxl+Oh1
◧◩◪
71. margin+aZ[view] [source] [discussion] 2023-07-08 15:37:20
>>lokar+MR
They don't really need to fetch it twice though? Fetch 0-1 times, use it according to what robots.txt allows.
◧◩
72. Mental+A01[view] [source] [discussion] 2023-07-08 15:47:01
>>mrkram+TS
Because something annoys you, it's destined to fail? I'm not following the logic.
◧◩◪
73. lisasa+G31[view] [source] [discussion] 2023-07-08 16:05:45
>>Ferret+fh
for the simple reason that many sites' robots.txt file is wrong.

Which is of course not the real reason.

The reason Google doesn't follow the robots.txt protocol is (1) they don't want to (2) they can get away with it.

◧◩◪
74. 2OEH8e+n41[view] [source] [discussion] 2023-07-08 16:10:00
>>bastaw+CL
Cool. Sounds like a you problem, you meaning crawlers and AI trainers. Now it will fall on every web developer to tag their data for it to be exploited by megacorps?

Now that I think of it- why do we put up with robots.txt at all?

> A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests

If someone overloads your site with automated requests how is that not criminal? Why aren't they liable?

replies(5): >>judge2+U41 >>dasil0+x71 >>thephy+xx1 >>wickof+f62 >>bastaw+tk7
75. pentag+s41[view] [source] 2023-07-08 16:10:28
>>skille+(OP)
The other day I was trying to search a comment of a YouTube video that I remembered, but wasn't able to find it with the google search "site:youtube.com [phrase of the comment]", later to find out that YouTube disallows search engines to index comments trough robots.txt https://www.youtube.com/robots.txt

>Disallow: /comment

So I guess that works for them.

replies(1): >>Distra+Qd1
◧◩◪◨
76. judge2+U41[view] [source] [discussion] 2023-07-08 16:12:20
>>2OEH8e+n41
You can block the crawler on your entire site. I’m not sure it’s true that it’s primarily used “to avoid overloading your site”.
replies(1): >>blacks+ag1
◧◩◪◨⬒⬓⬔⧯▣
77. Anthon+Q61[view] [source] [discussion] 2023-07-08 16:22:35
>>Spivak+bQ
Except that they almost never do exact reproductions of a work. If you were trying to do it on purpose you'd have to do some significant prompt engineering to get it to even come close. Because the nature of it is to smush together thousands of different things, not photocopy one in particular.

The exceptions will be like, pictures of a specific city's skyline. Not because it's copying a particular image, but because that's what that city's skyline looks like, so that's how it looks in an arbitrary picture of it. But those are the pictures that lack original creativity to begin with -- which is why the pictures in the training data are all the same and so is the output.

And people seem to make a lot of the fact that it will often reproduce watermarks, but the reason it does that isn't that it's copying a specific image. It's that there are a large number of images of that subject with that watermark. So even though it's not copying any of them in particular, it's been trained that pictures of that subject tend to have that watermark.

Obviously lawyers are going to have a field day with this, because this is at the center of an existing problem with copyright law. The traditional way you show copying is similarity (and access). Which no longer really means anything because you now have databases of billions of works, which are public (so everyone has access), and computers that can efficiently process them all to find the existing work which is most similar to any new one. And if you put those two works next to each other they're going to look similar to a human because it's the 99.9999999th percentile nearest match from a database of a billion images, regardless of whether the new one was actually generated from the existing one. It's the same reason YouTube Content ID has false positives -- except that its database only includes major Hollywood productions. A large image database would have orders of magnitude more.

◧◩◪◨
78. dasil0+x71[view] [source] [discussion] 2023-07-08 16:26:57
>>2OEH8e+n41
I don't understand what you have against robots.txt. It's just a way to signal what you want crawlers to do on your site. It's not complicated or mandatory, but it gives you a way to influence how your site is accessed. I'm not sure why you would jump straight to litigation as a better solution—that solves a much smaller set of problems at a much higher cost.
◧◩◪
79. floomk+M71[view] [source] [discussion] 2023-07-08 16:28:54
>>Ferret+fh
They are in the EU. If something was not meant to be accessible you may not scrape it.
80. thayne+l81[view] [source] 2023-07-08 16:30:49
>>skille+(OP)
Hopefully the result is actually a portable standard, and not filling out forms for google and every other company training AI.
◧◩◪◨⬒
81. floomk+L81[view] [source] [discussion] 2023-07-08 16:32:31
>>dazc+Z8
I think you meant it's a great way for google to punish you for not giving them full access
◧◩
82. Distra+Qd1[view] [source] [discussion] 2023-07-08 17:02:33
>>pentag+s41
In fairness, I'd expect YouTube comments to largely be noise and not worth indexing.
◧◩
83. h1fra+4e1[view] [source] [discussion] 2023-07-08 17:04:29
>>dbette+u2
Came here to say that, seems like nobody as the answer :/

Maybe they want to have finer details on page content, e.g: "you can index those pages but not those nodes" or "those nodes are also AI generated please ignore".

Otherwise I don't know, robots.txt is not sexy but definitely does the job.

◧◩
84. kmbfjr+kf1[view] [source] [discussion] 2023-07-08 17:13:25
>>konsch+ub
That isn’t what they are doing. The goal here is to use LLM to never have the end user leave Google.

Gone will be revenue sharing, gone will be users visiting other sites.

The goal is for Google to keep ALL the revenue, for content written by others.

Hope that works out for them. I have already taken down over 300 articles written on networking, Linux, FreeBSD, Wireguard, DSP, software defined radios. I am not feeding a machine that steals my writing, regardless if I never explicitly expected payment from the viewer.

replies(1): >>rafark+op1
◧◩◪◨⬒
85. blacks+ag1[view] [source] [discussion] 2023-07-08 17:18:40
>>judge2+U41
For sure, since those directives in your robots.txt don't actually compel the crawlers to do anything. They're more like a polite request, and plenty of bots ignore or 'accidentally' overstep them. I do think they have still some value, not just as a handy list of high-value targets - you may know that some part of your site has a bunch of similar links that it doesn't make sense to crawl or index (though there's always norel/nofollow...), or that some pages (/account/preferences etc.) just don't make sense for bots to be visiting. The general idea of extending the standard to cover training AI isn't a terrible idea, but it does seem like too little, too late.
replies(1): >>lakome+gl2
◧◩◪◨⬒⬓
86. tpxl+Oh1[view] [source] [discussion] 2023-07-08 17:27:00
>>remus+FT
> Assuming they don't reproduce large sections of copyrighted works

They do (or did). They showed the text around the search term, around a page or so, which made it possible to reconstruct the whole book without that much effort.

◧◩◪◨⬒
87. delfin+Pl1[view] [source] [discussion] 2023-07-08 17:48:29
>>nxpnsv+nd
Yes, inside the store which is private property, they can legally start enacting such restrictions. Outside the store, not so much
◧◩◪◨
88. rafark+Ln1[view] [source] [discussion] 2023-07-08 17:57:26
>>superk+uS
If a tool can access a url, does that not make it a browser?
replies(1): >>TeMPOr+WZ1
89. denton+bo1[view] [source] 2023-07-08 17:59:32
>>skille+(OP)
> You can join the web and AI communities’ discussion by signing up on our website and we'll share more information about this process soon.

> I accept Google's Terms and Conditions and acknowledge that my information will be used in accordance with Google's Privacy Policy.

Why is this being done through a Google mailing list? Why does Google want any public participation anyway? They usually just implement their new gee-whizz scheme, and start strong-arming web publishers into using it.

Like, why would I trust a process that is run by Google, to create a new mechanism for controlling search engine behaviour? Fox: meet henhouse.

◧◩◪
90. rafark+op1[view] [source] [discussion] 2023-07-08 18:06:15
>>kmbfjr+kf1
I’m not entirely sure that’s a bad thing for the user though. A few years ago, you could click on pretty much any blog post and you knew you were getting Hugh quality or at least relevant information related to your search query.

Nowadays most blog posts in the SERPs are full of spam and unnecessary filler text. I stopped clicking on random blogs because of how awful they’ve become. I’m currently using bing chat (which uses ChatGpt 4 under the hood) and it saves me a lot of time.

◧◩◪◨
91. denton+Jr1[view] [source] [discussion] 2023-07-08 18:17:15
>>chii+Mj
If you record 10 billion parameters from a 3-megapixel image, it's kinda disingenuous to pretend you haven't copied the image.
◧◩◪
92. denton+tt1[view] [source] [discussion] 2023-07-08 18:26:05
>>iamphi+Ca
Why would any webmaster allow any of their content be used to train an AI? What's in it for them?

The deal with searchbots is that you allow indexing because you want to be found. But no such quid-pro-quo occurs when the content is just fed into the maw of an AI trainer.

◧◩◪◨
93. thephy+xx1[view] [source] [discussion] 2023-07-08 18:52:13
>>2OEH8e+n41
> If someone overloads your site with automated requests how is that not criminal? Why aren't they liable?

Criminal requires a specific law in the criminal code be intentionally broken.

There is a world of difference between an intentional DoS and a crawler adding some marginal traffic to a server then backing off when the server responses fail.

94. margin+fA1[view] [source] 2023-07-08 19:10:21
>>skille+(OP)
There are problems with robots.txt if you actually try to implement it for a crawler. Consider this scenario:

  Allow: /foo
  Disallow: /bar
Consider the situation where /foo HTTP 301s to /bar, or 200s but with a canonical location header that is /bar. Do you follow the redirect? Do you index /foo?

In practice it's also often a directory of the paths the website owners don't want eyes to look at. Pretty common to find a list of uncomfortable content, especially on larger websites... like that time the dean of the college praised the philanthropy of Boko Haram. Real OSINT footgun.

replies(1): >>Animat+eF2
95. westur+ZJ1[view] [source] 2023-07-08 20:19:02
>>skille+(OP)
There are a number of opportunities to solve for carbon.txt, security.txt, content licenses, indication of [AI] provenance, and do better than robots.txt; hopefully with JSON-LD Linked Data.

> >>35888037 : security.txt, carbon.txt, SPDX SBOM, OSV, JSON-LD, blockcerts

"Google will label fake images created with its A.I" (re: IPTC, Schema org JSON-LD" (2023) >>35896000

From "Tell HN: We should start to add “ai.txt” as we do for “robots.txt”" (2023) >>35888037 :

> How many parsers should be necessary for https://schema.org/CreativeWork https://schema.org/license metadata for resources with (Linked Data) URIs?

96. activi+vY1[view] [source] 2023-07-08 21:58:39
>>skille+(OP)
"Google to ignore robots.txt for AI purposes"
◧◩◪◨⬒
97. TeMPOr+WZ1[view] [source] [discussion] 2023-07-08 22:10:36
>>rafark+Ln1
Not under any but most narrow of meanings, i.e. "can follow URLs / can talk HTTP". By itself, it's not a browser to users, it's not a browser to software developers, and it's definitely not a browser to lawyers and judges.
replies(1): >>rafark+c92
◧◩◪◨
98. TeMPOr+D02[view] [source] [discussion] 2023-07-08 22:16:57
>>superk+uS
> like weev w/wget

Speaking of this and other cases of trying to punish someone for every iteration of a for loop - I wonder if the result would be the same if the accused drove actual browser to click stuff in a for loop, vs. using curl directly. I imagine the same, but then...

... what if they paid N people some token amount of money, to have each of those people do one step of the loop and send them the result? Does executing a for loop entirely on in part on the human substrate instead of in silico is seen as abuse under CFAA?

(I have a feeling that it might not be - there's lots of jobs online and offline that involve one company paying lots of people some money for gathering information from their competitors, in a way the latter very much don't like.)

◧◩◪◨⬒
99. extra8+r52[view] [source] [discussion] 2023-07-08 23:00:28
>>dylan6+9c
Yet the keep getting sued and keep winning in the courts, at least in the U.S. Seems like they have a pretty good grasp of how the laws work.
◧◩◪◨
100. wickof+f62[view] [source] [discussion] 2023-07-08 23:10:24
>>2OEH8e+n41
Proposing to jail people for doing http requests to publicly available resources on a hacker forum?
◧◩◪◨⬒⬓
101. rafark+c92[view] [source] [discussion] 2023-07-08 23:40:43
>>TeMPOr+WZ1
Is there a legal definition of a web browser though? I think it’s an interesting topic.
102. lakome+0l2[view] [source] 2023-07-09 01:33:12
>>skille+(OP)
Please stop adding more and more irrelevant text files Google.

It's getting annoying.

There's nothing wrong with robots.txt. Don't change what works just because you Google developer have to justify your employment

◧◩◪◨⬒⬓
103. lakome+gl2[view] [source] [discussion] 2023-07-09 01:35:22
>>blacks+ag1
robots.txt tells the search engine which content is relevant. That's all.
◧◩◪◨⬒⬓⬔⧯▣
104. chii+gp2[view] [source] [discussion] 2023-07-09 02:16:33
>>gumbal+GL
> ai friendly licensed content

> The issue is using copyrighted content without consent

the consent is given implicitly if the content is available to the public for viewing. The copyright isn't being violated by an ai training model, as it isn't copied. The information contained within the works is not what's being copyrighted - it's the expression.

If the ai training algorithm is capable of extracting the information out of the works, and use it in another environment as part of some other works, you cannot claim copyright over such information.

This applies to style, patterns and other abstract information that could be extracted from works. It's as if a chef, upon reading many recipe books, produces a new recipe book (that contains information extracted from them) - the original creators of those recipe books cannot claim said chef had violated any copyright.

◧◩
105. Animat+eF2[view] [source] [discussion] 2023-07-09 05:34:31
>>margin+fA1
> There are problems with robots.txt if you actually try to implement it for a crawler.

Yes, although that's not what people are usually worried about.

I once tried to deal with that in Sitetruth's crawler. There are redirects at the HTTP level, redirects at the HTML level, and the HTTP->HTTPS thing. Resolving all that honestly is annoying, but possible. Sometimes you do need to look at the beginning of a file blocked by "robots.txt" to find that it is redirecting you elsewhere. It's like a door that says both "Keep Out" and "Please Use Other Door".

This is more of a pedantic problem than a real one.

106. mindcr+VN2[view] [source] 2023-07-09 07:22:59
>>skille+(OP)
Next up: "Google to explore alternatives to HTML".

Do you think they developed AMP and are heavily invested in the W3C for "the good of the community"?

And they already tried to "Googlify" cookies earlier.

107. JohnFe+cW6[view] [source] 2023-07-10 16:00:23
>>skille+(OP)
I agree that there needs to be better control over such things. Whatever comes from this, it also needs to avoid the fundamental problem with robots.txt: it's advisory only, and bots can freely ignore it. And many do.

A real solution has to have an effectiveness greater than just asking nicely and hoping that people are honorable.

◧◩
108. JohnFe+jW6[view] [source] [discussion] 2023-07-10 16:01:04
>>Kwpols+ab
> and convince OpenAI and Bing to do the same.

And everyone else as well.

◧◩
109. JohnFe+8X6[view] [source] [discussion] 2023-07-10 16:05:53
>>dbette+u2
I think robots.txt isn't suitable for this for the same reason it's not suitable for keeping other bots from crawling your site: adhering to what robots.txt says is optional, and plenty of bots opt to ignore it.
◧◩◪◨
110. bastaw+tk7[view] [source] [discussion] 2023-07-10 17:32:37
>>2OEH8e+n41
> Now it will fall on every web developer to tag their data for it to be exploited by megacorps?

If Google says they'll delist your site if they detect AI generated content that you haven't declared, that's also a you problem (you meaning webmasters). It's a bit silly to think that it's a purely one way relationship. You're more than welcome to block Google from indexing your site (trivially!) and they're welcome to not include you in their service for not following their guidelines.

[go to top]