zlacker

This press release really contains no substantial information (except the signup form), but the amount of "corpospeak words" in there that are usually euphemisms for bad news frankly worries me.

In particular those bits:

> A principled approach to evolving choice and control for web content

> We believe everyone benefits from a vibrant content ecosystem. Key to that is web publishers having choice and control over their content, and opportunities to derive value from participating in the web ecosystem. However, we recognize that existing web publisher controls were developed before new AI and research use cases.

> We believe it’s time for the web and AI communities to explore additional machine-readable means for web publisher choice and control for emerging AI and research use cases.

That's an awful lot of talk about "choice", and even more so "evolving choice". That's particularly odd when the choice of most publishers seems to be rather clear: "don't scrape content for AI training without at least asking before" - and robots.txt is perfectly capable of expressing that choice.

So the ones that seem unhappy with the available means of choice seem to be the AI scrapers, not the publishers.

So my preliminary translation of this from corpospeak would be:

"Look guys, we were fine with robots.txt as long as we were only scraping your sites for search indexing.

But now the AI race is on and gathering training data has just become Too Important, so we're planning to ignore robots.txt in the near future and just scrape the entirety of your sites for AI training.

Instead, we'll offer you the choice of whether you want to let us scrape in exchange for some yet-to-be-determined compensation or whether you just provide the data for free. If we're particularly nice, we'll also give you an option to opt-out of scraping altogether. However, this option will be separate from robots.txt and you will have to explicitly add it to your site (provided you get to know about it in the first place)"

That being said, I find robots.txt a bit strange for a target for this. Robots.txt really is nothing - it's not a license and has no legal significance (afaik) and it never prevented scraping on a technical level either. All it did was give friendly scrapers a hint, so they don't accidentally step on the publisher's toes - but it never prevented anyone from intentionally scraping stuff they weren't supposed to.

On the other hand, if some courts did interpret robots.txt as some kind of impromptu licence, that interpretation probably wouldn't change, whether Google likes the standard or not. Also, people who employ real technical measures (ratelimiting, captchas, etc) will probably continue to do too.

So if that's what they're planning to do, my only explanation would be that there is a large amount of small and "low-hanging fruit" sites (probably with inexperienced devs) that don't want to be scraped but really only added a robots.txt to block scrapers and didn't do anything else - and Google is planning to use those for AI training when all the large social networks are increasingly blocking them off now.