Change for the sake of it?
It's kind of implied: specifying sitemaps/allowance/copyright for different use cases: search, scraping, republishing, training etc. and perhaps adding some of the non standard extensions: Crawl-delay, default host, even sitemap isn't part of the robots.txt standard
> We believe it’s time for the web and AI communities to explore additional machine-readable means for web publisher choice and control for emerging AI and research use cases.
"Your do not enter sign uses font we don't like, so we'll just ignore it"
1 - https://support.google.com/maps/answer/1725632?hl=en#zippy=%...
[0] https://developers.google.com/search/docs/crawling-indexing/...
Wow, that's absurd. It would have been better not to have any mechanism at all.
Edit: Alternatively, have a "Harvest" section in "robots.txt", using the same established syntax and semantics. This may come with the advantage of making it clear that agents should default to the general "robots.txt" rules in absence of any such rules. Moreover, existing content management systems will already provide means for maintaining "robots.txt" and there's no need to update those. (We may also introduce an "Index" section for the established purpose of "robots.txt", with any bare, untitled rules defaulting to this, thus providing compatibility.)
Example:
#file "robots.txt"
Index # optional section heading (maybe useful for switching context)
User-agent: *
Allow: /
Disallow: /test/
Disallow: /private/
User-agent: Badbot
Disallow: /
Harvest # additional rules for scraping
User-agent: *
Disallow: /blog/
Disallow: /protected-artwork/There's no AI involved in web crawling. If you come to my site, I'll tell you which pages you can visit/index, and which pages you can't, end of the story
Yes, there are security concerns with people putting /very-secret-admin-panel in their robots.txt and letting malicious actors know what URLs they should target. But if /very-secret-admin-panel is never linked by any public page, then the bot won't encounter it, therefore this stuff should never belong to robots.txt.
Please keep it as straightforward as this and don't add any AI bullshit to one of the few remaining simple processes in web development and administration.
I prefer the term ‘Chad third-party scraper’ [1]
https://pbs.twimg.com/media/FxkeJmjakAENFI8?format=jpg&name=...
[0] https://developers.google.com/search/docs/crawling-indexing/...
should you get to decide if people can take pictures of your store?
Especially since they're letting stores pay money to be the first recommended store.
Robots.txt exists because shop photographers want to be allowed to take pictures rather than be blocked tout court.
I see this argument made over and over again here on HN and it’s puzzling that people always stop at the first part.
Companies won’t stop at the “look at your content” phase. They will use the knowledge gathered by looking at your content to do something else. That’s the problematic part.
(Edit: How is a factual, on-topic statement, providing a source-link for its claim, downvoted? You may not favor these regulations, but they still do exist.)
Retail companies research what other retail companies are doing and copy them all the time... was the answer supposed to be no here?
I find this debate very aligned to copyright debates.
They want to introduce a line in robots.txt that says "not for training AI". So nobody else can use publci data to train their AI. They already did.
The value of a store is the ability to buy products from it, you taking a photo of it doesn't impact that transaction of value that at all. The value of content online is the very act of reading it/consuming it.
A scraper is getting a free lunch, that is clear. They are trading nothing for something, and as the owner of the something that is not the deal I have chosen to make.
The business has the right to ask you to leave if you violate their policies. In fact, they can ask you to leave for (almost) any reason at all. They may have some limited right to remove you using a reasonable amount of force, depending on the jurisdiction.
Once you've left or been removed from their property, you still have the legal right to take photos of it from the public place you're now standing in. If you can view the photos or are they're selling through their window, you can keep taking photos of it.
They don't have the right to confiscate your camera or the pictures you took. Your rights in terms of what you can do with those photos may have limitations (e.g. redistribution, reproduction), particularly if you photographed copyrighted works.
This is why the parent's comment confused me so much. In most of the world you live in a society where yeah you have the freedom to take photos of stuff, or copy it down on a clipboard or whatever, and use it as competitive intelligence to improve your own business. And thousands of businesses are doing it every day.
It becomes integral part of a business product. That is the problematic part.
You going into a store and take pictures of some art to use as a reference material is not an issue.
But if you take those pictures and you use them to make a program that than spits out new art that is just a mix of those images patched together then, imo, that’s an issue.
Of course it's OK to take note of what stock is on a store's shelf, go back to your own business, and sell the same stock. It's also ubiquitous. It is de facto practiced globally by everyone, it's generally legal, and it's morally fine. Broadly speaking we call this competitive intelligence or market intelligence.
The intent of robots.txt is to help crawlers, for example, to keep crawlers from getting stuck in a recursive loop of dynamic pages, or from crawling pages with no value. robots.txt is not for banning, restricting, or hindering crawlers.
The source content is part of the AI product. There is no AI product without the source content.
This is not you going to a store and see what they sell and adjust your offering. You have no offering without the original store’s content.
I think it's almost a guarantee that courts will start finding exact AI reproductions of copyrighted work to be infringement.
Where the analogy might come into play is that if you take a photo of a copyrighted work there are limitations on what you can do with your photo, without infringing on that copyright. I have no idea if the courts will apply that stuff to AI, for instance there's actually a fair bit of leeway if you take a photo which contains only a portion of a copyrighted work and then you want to sell or redistribute that photo. One might argue that this legal principle applies to AI as well... lawyers are already having a field day with this stuff I'm sure.
they aren't copying the content. They are learning off the content, and produce more like it but not a copy.
but when people do that, it is allowed isnt it? So what is special about AI, other than the scale?
In particular those bits:
> A principled approach to evolving choice and control for web content
> We believe everyone benefits from a vibrant content ecosystem. Key to that is web publishers having choice and control over their content, and opportunities to derive value from participating in the web ecosystem. However, we recognize that existing web publisher controls were developed before new AI and research use cases.
> We believe it’s time for the web and AI communities to explore additional machine-readable means for web publisher choice and control for emerging AI and research use cases.
That's an awful lot of talk about "choice", and even more so "evolving choice". That's particularly odd when the choice of most publishers seems to be rather clear: "don't scrape content for AI training without at least asking before" - and robots.txt is perfectly capable of expressing that choice.
So the ones that seem unhappy with the available means of choice seem to be the AI scrapers, not the publishers.
So my preliminary translation of this from corpospeak would be:
"Look guys, we were fine with robots.txt as long as we were only scraping your sites for search indexing.
But now the AI race is on and gathering training data has just become Too Important, so we're planning to ignore robots.txt in the near future and just scrape the entirety of your sites for AI training.
Instead, we'll offer you the choice of whether you want to let us scrape in exchange for some yet-to-be-determined compensation or whether you just provide the data for free. If we're particularly nice, we'll also give you an option to opt-out of scraping altogether. However, this option will be separate from robots.txt and you will have to explicitly add it to your site (provided you get to know about it in the first place)"
That being said, I find robots.txt a bit strange for a target for this. Robots.txt really is nothing - it's not a license and has no legal significance (afaik) and it never prevented scraping on a technical level either. All it did was give friendly scrapers a hint, so they don't accidentally step on the publisher's toes - but it never prevented anyone from intentionally scraping stuff they weren't supposed to.
On the other hand, if some courts did interpret robots.txt as some kind of impromptu licence, that interpretation probably wouldn't change, whether Google likes the standard or not. Also, people who employ real technical measures (ratelimiting, captchas, etc) will probably continue to do too.
So if that's what they're planning to do, my only explanation would be that there is a large amount of small and "low-hanging fruit" sites (probably with inexperienced devs) that don't want to be scraped but really only added a robots.txt to block scrapers and didn't do anything else - and Google is planning to use those for AI training when all the large social networks are increasingly blocking them off now.
AI is software, it doesnt “learn” as a human does and even if it did it would still have to be bound by the same rules as any other piece of software and human.
exactly, so there's zero reason to prevent anyone from using a piece of software (which slurps a lot of information off the internet), and produce new works that do not break currently copyrighted content.
I'm honestly surprised they're required to abstain from doing so at the author's request.
You can only read the context of the match after finding the search result after all, not the whole book.
It's an example of significant overreach of intellectual property from how I see it. The robot.txt rational doesn't apply there either, as their scanning does not impact anyone's resources. And it's been published, which makes it public by definition.
I think Google is probably thinking hard about the problem of training AI: you don't want to train on the output of other AI. That doesn't mean the content shouldn't be processed, just that it shouldn't be used for training. Or maybe it's worth noting that some content is derived from other content that you've manually produced, versus content derived from the content of third parties.
Said another way, I expect that Google isn't just implementing a new allowlist/denylist. It's likely about exposing new information about content.
There is not much point to give crawlers a lot of content generation; but rather only the succinct "prompt".
In that way, it will be easy to signal to crawlers what to crawl, and user can be read the content after the function of LLM...
That was never not true. The difference is that AI can't violate copyright, only humans can. The legal not-so-gray area is whether "spat out by an AI after prompting" is a performance of the work and if so, what human is responsible for the copying.
That's not to say that I disagree. In most cases robots.txt is not legally binding. It only becomes a legal danger to not follow it when the person running the site has power and can buy a DA to indict you.
Which is of course not the real reason.
The reason Google doesn't follow the robots.txt protocol is (1) they don't want to (2) they can get away with it.
Now that I think of it- why do we put up with robots.txt at all?
> A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests
If someone overloads your site with automated requests how is that not criminal? Why aren't they liable?
>Disallow: /comment
So I guess that works for them.
The exceptions will be like, pictures of a specific city's skyline. Not because it's copying a particular image, but because that's what that city's skyline looks like, so that's how it looks in an arbitrary picture of it. But those are the pictures that lack original creativity to begin with -- which is why the pictures in the training data are all the same and so is the output.
And people seem to make a lot of the fact that it will often reproduce watermarks, but the reason it does that isn't that it's copying a specific image. It's that there are a large number of images of that subject with that watermark. So even though it's not copying any of them in particular, it's been trained that pictures of that subject tend to have that watermark.
Obviously lawyers are going to have a field day with this, because this is at the center of an existing problem with copyright law. The traditional way you show copying is similarity (and access). Which no longer really means anything because you now have databases of billions of works, which are public (so everyone has access), and computers that can efficiently process them all to find the existing work which is most similar to any new one. And if you put those two works next to each other they're going to look similar to a human because it's the 99.9999999th percentile nearest match from a database of a billion images, regardless of whether the new one was actually generated from the existing one. It's the same reason YouTube Content ID has false positives -- except that its database only includes major Hollywood productions. A large image database would have orders of magnitude more.
Maybe they want to have finer details on page content, e.g: "you can index those pages but not those nodes" or "those nodes are also AI generated please ignore".
Otherwise I don't know, robots.txt is not sexy but definitely does the job.
Gone will be revenue sharing, gone will be users visiting other sites.
The goal is for Google to keep ALL the revenue, for content written by others.
Hope that works out for them. I have already taken down over 300 articles written on networking, Linux, FreeBSD, Wireguard, DSP, software defined radios. I am not feeding a machine that steals my writing, regardless if I never explicitly expected payment from the viewer.
They do (or did). They showed the text around the search term, around a page or so, which made it possible to reconstruct the whole book without that much effort.
> I accept Google's Terms and Conditions and acknowledge that my information will be used in accordance with Google's Privacy Policy.
Why is this being done through a Google mailing list? Why does Google want any public participation anyway? They usually just implement their new gee-whizz scheme, and start strong-arming web publishers into using it.
Like, why would I trust a process that is run by Google, to create a new mechanism for controlling search engine behaviour? Fox: meet henhouse.
Nowadays most blog posts in the SERPs are full of spam and unnecessary filler text. I stopped clicking on random blogs because of how awful they’ve become. I’m currently using bing chat (which uses ChatGpt 4 under the hood) and it saves me a lot of time.
The deal with searchbots is that you allow indexing because you want to be found. But no such quid-pro-quo occurs when the content is just fed into the maw of an AI trainer.
Criminal requires a specific law in the criminal code be intentionally broken.
There is a world of difference between an intentional DoS and a crawler adding some marginal traffic to a server then backing off when the server responses fail.
Allow: /foo
Disallow: /bar
Consider the situation where /foo HTTP 301s to /bar, or 200s but with a canonical location header that is /bar. Do you follow the redirect? Do you index /foo?In practice it's also often a directory of the paths the website owners don't want eyes to look at. Pretty common to find a list of uncomfortable content, especially on larger websites... like that time the dean of the college praised the philanthropy of Boko Haram. Real OSINT footgun.
> >>35888037 : security.txt, carbon.txt, SPDX SBOM, OSV, JSON-LD, blockcerts
"Google will label fake images created with its A.I" (re: IPTC, Schema org JSON-LD" (2023) >>35896000
From "Tell HN: We should start to add “ai.txt” as we do for “robots.txt”" (2023) >>35888037 :
> How many parsers should be necessary for https://schema.org/CreativeWork https://schema.org/license metadata for resources with (Linked Data) URIs?
Speaking of this and other cases of trying to punish someone for every iteration of a for loop - I wonder if the result would be the same if the accused drove actual browser to click stuff in a for loop, vs. using curl directly. I imagine the same, but then...
... what if they paid N people some token amount of money, to have each of those people do one step of the loop and send them the result? Does executing a for loop entirely on in part on the human substrate instead of in silico is seen as abuse under CFAA?
(I have a feeling that it might not be - there's lots of jobs online and offline that involve one company paying lots of people some money for gathering information from their competitors, in a way the latter very much don't like.)
It's getting annoying.
There's nothing wrong with robots.txt. Don't change what works just because you Google developer have to justify your employment
> The issue is using copyrighted content without consent
the consent is given implicitly if the content is available to the public for viewing. The copyright isn't being violated by an ai training model, as it isn't copied. The information contained within the works is not what's being copyrighted - it's the expression.
If the ai training algorithm is capable of extracting the information out of the works, and use it in another environment as part of some other works, you cannot claim copyright over such information.
This applies to style, patterns and other abstract information that could be extracted from works. It's as if a chef, upon reading many recipe books, produces a new recipe book (that contains information extracted from them) - the original creators of those recipe books cannot claim said chef had violated any copyright.
Yes, although that's not what people are usually worried about.
I once tried to deal with that in Sitetruth's crawler. There are redirects at the HTTP level, redirects at the HTML level, and the HTTP->HTTPS thing. Resolving all that honestly is annoying, but possible. Sometimes you do need to look at the beginning of a file blocked by "robots.txt" to find that it is redirecting you elsewhere. It's like a door that says both "Keep Out" and "Please Use Other Door".
This is more of a pedantic problem than a real one.
Do you think they developed AMP and are heavily invested in the W3C for "the good of the community"?
And they already tried to "Googlify" cookies earlier.
A real solution has to have an effectiveness greater than just asking nicely and hoping that people are honorable.
If Google says they'll delist your site if they detect AI generated content that you haven't declared, that's also a you problem (you meaning webmasters). It's a bit silly to think that it's a purely one way relationship. You're more than welcome to block Google from indexing your site (trivially!) and they're welcome to not include you in their service for not following their guidelines.