zlacker

[parent] [thread] 18 comments
1. voytec+(OP)[view] [source] 2023-07-08 06:20:02
Seems like it's intended for content stealing from every place that doesn't immediately implement Google's New Web Order as an addition to robots.txt.

"Your do not enter sign uses font we don't like, so we'll just ignore it"

replies(3): >>saagar+Y3 >>Ferret+kd >>LinuxB+Zy
2. saagar+Y3[view] [source] 2023-07-08 07:10:36
>>voytec+(OP)
What makes you think this? Why do you think Google actually cares about your sign if all they want to do is steal from you?
replies(1): >>oneeye+q6
◧◩
3. oneeye+q6[view] [source] [discussion] 2023-07-08 07:42:20
>>saagar+Y3
IIRC, Google has precedent on this - e.g. scanning full books for search unless the owner explicitly refused.
replies(2): >>dylan6+e8 >>411111+Hl
◧◩◪
4. dylan6+e8[view] [source] [discussion] 2023-07-08 08:02:30
>>oneeye+q6
They are the ultimate ask for forgiveness rather than permission. Copyright has been a thing for a long long time before googs developed their scanning. They were well aware that it should have been an opt-in, but knew they’d never gain traction for their little project. So they bull in a China shop’d their way to a point of too far to stop them.
replies(3): >>philip+tG >>remus+KP >>extra8+w12
5. Ferret+kd[view] [source] 2023-07-08 09:04:59
>>voytec+(OP)
To be clear, robots.txt is not legally binding, Google is not bound to follow it, and in fact I believe that Google doesn't follow it and hasn't for a very long time, for the simple reason that many sites' robots.txt file is wrong.

The intent of robots.txt is to help crawlers, for example, to keep crawlers from getting stuck in a recursive loop of dynamic pages, or from crawling pages with no value. robots.txt is not for banning, restricting, or hindering crawlers.

replies(3): >>superk+zO >>lisasa+LZ >>floomk+R31
◧◩◪
6. 411111+Hl[view] [source] [discussion] 2023-07-08 10:48:44
>>oneeye+q6
Your phrasing makes it sound like that's a negative.

I'm honestly surprised they're required to abstain from doing so at the author's request.

You can only read the context of the match after finding the search result after all, not the whole book.

It's an example of significant overreach of intellectual property from how I see it. The robot.txt rational doesn't apply there either, as their scanning does not impact anyone's resources. And it's been published, which makes it public by definition.

replies(1): >>oneeye+Tt
◧◩◪◨
7. oneeye+Tt[view] [source] [discussion] 2023-07-08 12:24:28
>>411111+Hl
Oh, I agree with you. I think the whole idea of legislating against machines accessing public content is a very slippery slope.
8. LinuxB+Zy[view] [source] 2023-07-08 13:08:51
>>voytec+(OP)
AFAIK the only way to reduce content stealing by bots is to add authentication requirements to a page and to detect if a real persons authentication is being shared by bots then instantly and automatically rotate their password each time that occurs.
◧◩◪◨
9. philip+tG[view] [source] [discussion] 2023-07-08 14:00:50
>>dylan6+e8
They don't even ask for forgiveness. They are "don't admit you've done anything wrong to begin with."
◧◩
10. superk+zO[view] [source] [discussion] 2023-07-08 14:55:04
>>Ferret+kd
That's just because google is a corporate person who is more equal than a human person. Human persons, at least in the USA, get charged under the CFAA 1030 law if they're using non-browser tools to access the public website of someone with power and/if they happen to rock the boat (like weev w/wget).

That's not to say that I disagree. In most cases robots.txt is not legally binding. It only becomes a legal danger to not follow it when the person running the site has power and can buy a DA to indict you.

replies(2): >>rafark+Qj1 >>TeMPOr+IW1
◧◩◪◨
11. remus+KP[view] [source] [discussion] 2023-07-08 15:02:55
>>dylan6+e8
Copyright is to do with protecting reproduction of works, no? What google has done here is scanning the book and indexed the content, presumably so it makes it easier for users to search books for relevant material. Assuming they don't reproduce large sections of copyrighted works in their search results I don't feel like they're doing anything wrong here.
replies(1): >>tpxl+Td1
◧◩
12. lisasa+LZ[view] [source] [discussion] 2023-07-08 16:05:45
>>Ferret+kd
for the simple reason that many sites' robots.txt file is wrong.

Which is of course not the real reason.

The reason Google doesn't follow the robots.txt protocol is (1) they don't want to (2) they can get away with it.

◧◩
13. floomk+R31[view] [source] [discussion] 2023-07-08 16:28:54
>>Ferret+kd
They are in the EU. If something was not meant to be accessible you may not scrape it.
◧◩◪◨⬒
14. tpxl+Td1[view] [source] [discussion] 2023-07-08 17:27:00
>>remus+KP
> Assuming they don't reproduce large sections of copyrighted works

They do (or did). They showed the text around the search term, around a page or so, which made it possible to reconstruct the whole book without that much effort.

◧◩◪
15. rafark+Qj1[view] [source] [discussion] 2023-07-08 17:57:26
>>superk+zO
If a tool can access a url, does that not make it a browser?
replies(1): >>TeMPOr+1W1
◧◩◪◨
16. TeMPOr+1W1[view] [source] [discussion] 2023-07-08 22:10:36
>>rafark+Qj1
Not under any but most narrow of meanings, i.e. "can follow URLs / can talk HTTP". By itself, it's not a browser to users, it's not a browser to software developers, and it's definitely not a browser to lawyers and judges.
replies(1): >>rafark+h52
◧◩◪
17. TeMPOr+IW1[view] [source] [discussion] 2023-07-08 22:16:57
>>superk+zO
> like weev w/wget

Speaking of this and other cases of trying to punish someone for every iteration of a for loop - I wonder if the result would be the same if the accused drove actual browser to click stuff in a for loop, vs. using curl directly. I imagine the same, but then...

... what if they paid N people some token amount of money, to have each of those people do one step of the loop and send them the result? Does executing a for loop entirely on in part on the human substrate instead of in silico is seen as abuse under CFAA?

(I have a feeling that it might not be - there's lots of jobs online and offline that involve one company paying lots of people some money for gathering information from their competitors, in a way the latter very much don't like.)

◧◩◪◨
18. extra8+w12[view] [source] [discussion] 2023-07-08 23:00:28
>>dylan6+e8
Yet the keep getting sued and keep winning in the courts, at least in the U.S. Seems like they have a pretty good grasp of how the laws work.
◧◩◪◨⬒
19. rafark+h52[view] [source] [discussion] 2023-07-08 23:40:43
>>TeMPOr+1W1
Is there a legal definition of a web browser though? I think it’s an interesting topic.
[go to top]