zlacker

I wonder at what point Google needs to pay attention to this. If a large number of websites are behind registration or paywalls, then the chance that any individual person searching would have a subscription (or registration) to any individual news site is pretty low. People might have some small number of subscriptions, but not for many sites.

So, why keep returning search results that the end user can't use without registration or purchase? It's essentially "page cloaking" when the rendered page doesn't match google sees.

To me, if you want a paywall, that should come with the consequence that your site isn't included in search results for the general public.

Edit: It's also getting irritating here on HN. I might have a subscription or login to one or two sites, but HN regularly shares stuff from Medium, WSJ, NYT, Wired, and so on. I have to imagine that most people following these posted stories hit the reg/pay-wall.

replies(5): >>ameliu+61 >>little+N1 >>baby-y+12 >>Vespas+G3 >>dazc+Fa

>>tyingq+(OP)
Perhaps Google can get info on how many article-views the user has left, so they can take that into account when returning search results.

>>tyingq+(OP)
I've been wanting the "Show only free/non-subscription results" checkmark in Google (and other search engines) for a while, but I don't think it will ever happen.

>>tyingq+(OP)
could not agree with this more and this has gone on far too long.

why does google allow this? as you say it is 100% cloaking to have the entire article indexed but not present it in the subsequent page.

Sure, publishers feel they need paywalls for revenue purposes; have at it. That should not absolve them from the "rules" everyone else has to follow.

Cloaking refers to the practice of presenting different content or URLs to human users and search engines. Cloaking is considered a violation of Google's Webmaster Guidelines because it provides our users with different results than they expected. [0]

[0] - https://developers.google.com/search/docs/advanced/guideline...

replies(2): >>leephi+9b >>Apollo+by

>>tyingq+(OP)
I'm also curious whether googles core business will be affected.

It stands to reason that a subscription model reduces the dependency on ads and strengthens the negotiation position of publishers.

I'm pretty sure Google has made an art form out of being a rent seeking middle man.

>>tyingq+(OP)
Google used to have a policy of banning cloaked sites but they also don't want users to search for 'New York Times' and get some result that is not 'The New York Times'.

So that rule seems to go out of the window for this reason?

>>baby-y+12
If they are cloaking, can we get around the paywall by using the Google crawler user agent string?

replies(2): >>tyingq+vc >>gpm+5j

>>leephi+9b
That is one of many workarounds the various paywall-buster browser extensions use. Either setting to the Google crawler user-agent, or the Google AdBot agent. I would guess you would need to not send cookies also. They could also be clever and check that your IP/Netblock is a Google owned one.

>>leephi+9b
Huh, I thought they published a range of IP addresses they used to prevent this, but apparently they don't use an entirely consistent one and you need to do a dns request [1] to actually check if something is google's crawler. I'm willing to bet most organizations aren't doing that... so maybe.

[1] https://developers.google.com/search/docs/advanced/crawling/...

replies(1): >>leephi+jP

>>baby-y+12
Pretty sure it's fear of it being added to anti trust complaint.

It is really frustrating as a user, and undoubtedly Google knows this. So an impending lawsuit is the only reason I can see for them not blocking nyt/other sites that do this.

>>gpm+5j
I just installed a user agent switcher and tried it on a prominent financial news site. The offer to subscribe was replaced by the article when I reloaded using the Googlebot user agent.