zlacker

You’re easy to track even when your data has been anonymized

submitted by SkyMar+(OP) on 2019-11-02 16:15:18 | 43 points 18 comments
[view article] [source] [links] [go to bottom]
replies(8): >>jandre+x4 >>SlowRo+88 >>JimmyR+m8 >>brende+3e >>aiiane+Fe >>gwrigh+9f >>dictum+bf >>merric+Ih
1. jandre+x4[view] [source] 2019-11-02 17:02:40
>>SkyMar+(OP)
It is worse than anonymized data: you are easy to track in databases that are not storing data about you at all, only measuring the broader environment you operate in (typically for innocuous purposes). This data is inherently "anonymous" in that it was never designed to be associated with or track a person but you can nonetheless reconstruct identity and other information about people from this data.

Individual anonymity in a technical sense is impossible in an environment with network connected sensors. Above a critical mass of sensors, which we have far exceeded in most urbanized areas, there are no technical measures that can keep a person from being tracked.

replies(1): >>SlowRo+q7
◧◩
2. SlowRo+q7[view] [source] [discussion] 2019-11-02 17:35:47
>>jandre+x4
Example of what you are talking about specifically?
replies(1): >>jandre+Ia
3. SlowRo+88[view] [source] 2019-11-02 17:43:21
>>SkyMar+(OP)
I’m not sure if I’m missing the point. Using the reference link; I think they’re saying there is a 75-85% chance you are the only person in your zip code with your gender and your birthdate.

This does not seem that surprising or a new technological development.

replies(1): >>EsssM7+ui
4. JimmyR+m8[view] [source] 2019-11-02 17:44:57
>>SkyMar+(OP)
Lets say with web-logs. What is "Anonymized"? Even when the datasets get "anonymized", at scale, information leaks.

Some of those IP addresses are static IP addresses, how could you tell whose?

Some of those URLs may end up with some form of PII in the query string from some forgotten backend service

Some of those user agents ended up having a kaspersky unique identifier

someone saves your website and when they open it some tag re-fires capturing they opened from /Users/John.Smith/yourpage.html

Facebook and google and others add link decoration tracking in the url, and suddenly a unique identifier appears across various hits, even if it wasn't added by the site owner.

There may be account identifiers, hashes or tokens linked to emails, phone numbers. So while the log dataset is market as low risk if lost losing some of the mapping tables at any point in the future would turn it into full PII.

◧◩◪
5. jandre+Ia[view] [source] [discussion] 2019-11-02 18:12:39
>>SlowRo+q7
When you move through any environment, it leaves a discernible trace in a (to most people) surprisingly large swath of sensor platforms that exist solely for boring industrial purposes like building management, measuring the operating environment of equipment to improve efficiency, etc, never mind systems actually designed to indirectly capture people (like cameras). Your existence perturbs the environment and leaves a faint footprint in the data. As a trivial example, transient proximity of people creates small fluctuations in measured temperatures. There are analytical techniques that reliably and systematically isolate and amplify those traces so that you can fingerprint and track a person using them. The typical urbanized environment is littered with these sensors and it has been repeatedly demonstrated that the measurements coming off these sensors can be used to constructively identify specific individuals in the environment.

I think the gap for most people is not the existence of these sensors, which capture nothing about a person in any kind of direct way, or that people perturb their environment in some abstract way, but the existence of analytic techniques that allow someone to reconstruct detailed personal information from large collections of extremely oblique measurements of the broader environment.

The analytic methods for doing this type of reconstruction are quite clever and non-obvious, which I guess would need to be the case for it to be surprising. It is nothing at all like typical web or enterprise analytics -- you are using measured physics and constraints on that physics to infer environmental dynamics that you can't measure directly.

replies(1): >>pacala+De
6. brende+3e[view] [source] 2019-11-02 18:55:24
>>SkyMar+(OP)
The only way to properly anonymize data is to aggregate it in such a way that you can't undo the aggregation. Anonymized data can nearly always be de-anonymized if you have either a) sufficient volume of data or b) access to the raw non-anonymized source data.

The problem is that most data surveillance systems store the raw source data instead of just keeping metrics in aggregate form. Thus, it's almost always possible to de-anonymize data.

replies(1): >>icebra+qj
◧◩◪◨
7. pacala+De[view] [source] [discussion] 2019-11-02 19:02:57
>>jandre+Ia
The other comprehension gap is that, thanks to Moore's law, these methods can be deployed at scale. Everyone is now a target, 24/7. In the good old days of XXth century and Bond movies, it took a highly paid analyst to target someone personally. Which economically limited the intrusion to a tiny sliver of the population.
8. aiiane+Fe[view] [source] 2019-11-02 19:03:15
>>SkyMar+(OP)
This is yet another article that conflates de-identified data with anonymization.

There are ways to create anonymous datasets, but they generally involve aggregation, not just removing the identifiers.

It's unfortunate that general lay person understanding of the concepts at work here doesn't tend to extend to this distinction. It would help drive privacy conversations if this were more commonly understood.

9. gwrigh+9f[view] [source] 2019-11-02 19:07:47
>>SkyMar+(OP)
From the article:

> It isn’t all bad news. These same reidentification techniques were used by journalists working at the New York Times earlier this year to expose Donald Trump’s tax returns from 1985 to 1994.

Flippant comments like that make it hard to take the authors seriously. Their concern for privacy apparently evaporates when the techniques are applied against people they don't like.

replies(2): >>icebra+Kj >>pmoria+7H
10. dictum+bf[view] [source] 2019-11-02 19:07:55
>>SkyMar+(OP)
For many Internet related phenomena, you can find an example where it already happened a long time ago with AOL.

https://en.m.wikipedia.org/wiki/AOL_search_data_leak

11. merric+Ih[view] [source] 2019-11-02 19:35:01
>>SkyMar+(OP)
Different article about same study discussed here 3 months ago:

https://news.ycombinator.com/item?id=20513521 (261 points/94 comments)

◧◩
12. EsssM7+ui[view] [source] [discussion] 2019-11-02 19:42:28
>>SlowRo+88
The Nature paper is about the use of ML to augment regular statistical methods with a higher degree of match quality. This is because the neural network is able to identify dataset features beyond mere obvious group intersects.
◧◩
13. icebra+qj[view] [source] [discussion] 2019-11-02 19:50:46
>>brende+3e
Yes. For example, the GDPR distinguishes full datasets where identifiers have been removed (calling it "pseudonymisation") from truly anonymized data. The former is still subject to the regulation, just like any other personal data.
◧◩
14. icebra+Kj[view] [source] [discussion] 2019-11-02 19:53:23
>>gwrigh+9f
I can be against the means while acknowledging that they may occasionally produce good results. If they were advocating for tracking only people they don't like, then you'd have a point.
◧◩
15. pmoria+7H[view] [source] [discussion] 2019-11-03 00:26:46
>>gwrigh+9f
The argument has been made that the public deserves access to every President's and Presidential candidate's tax records, so they can make informed decisions about whether to elect or re-elect them.

Things such as conflicts of interest, crimes, and lies about where/how they got their money and whether they're really as wealthy as they claim to be, whether they've cheated on their taxes or paid unfairly low taxes considering their enormous wealth are all things that could influence these critically important decisions on the part of the public.

A further argument is that officials serving in public office don't have the same expectation of privacy that private citizens do.

In view of these two arguments and others it's not difficult to see why the authors of this article need not consider the revelation of Trump's tax returns a good thing merely because they don't like him.

Further, there is no evidence in the article that its authors would not be concerned about the privacy rights of other people they don't like who aren't: 1 - the President of the US, and 2 - not public officials.

replies(1): >>gwrigh+bS
◧◩◪
16. gwrigh+bS[view] [source] [discussion] 2019-11-03 02:56:36
>>pmoria+7H
I'm not unaware of the argument that candidates should release their tax returns, but it is not the law right now.

Up until the point that access to Trump's tax returns was mentioned the article was warning about the false privacy associated with anonymizing identity.

I can understand the argument that candidates should reveal their financial history. But that doesn't mean otherwise reasonable concerns about false anonymity should be suspended when talking about the anonymity of one particular person who has explicitly asserted their privacy rights.

Even if you think the authors were making a more general statement about all candidates and not just Trump, that seems like a terrible argument to me. In the cases of candidates for office, voters are free to penalize candidates who don't reveal enough information about themselves by not voting for them. There is no need to soften any privacy concerns about anonymized identities.

replies(1): >>pmoria+nX
◧◩◪◨
17. pmoria+nX[view] [source] [discussion] 2019-11-03 04:15:07
>>gwrigh+bS
"voters are free to penalize candidates who don't reveal enough information about themselves by not voting for them"

Compare these two hypothetical scenarios:

1 - Voters don't have access to the candidate's tax records

2 - Due to released tax records, the voters know for certain all of the below facts about the candidate: A - The candidate paid no taxes, B - The candidate cheated on their taxes, C - The candidate is not as rich as they claim to be, D - The candidate's businesses lost money so they're not as good a businessperson as they claim to be

In the first hypothetical scenario the voters the voters know there's a possibility that the candidate might be hiding something, in the second hypothetical scenario the voters know for certain that the candidate is a lawbreaking, tax cheating, lying hypocrite.

In which of these hypothetical scenario do you think the voters are going to penalize the candidate more?

replies(1): >>gwrigh+FI1
◧◩◪◨⬒
18. gwrigh+FI1[view] [source] [discussion] 2019-11-03 17:04:28
>>pmoria+nX
This is a false choice and a different one than I suggested. You would have to consider this scenario also:

A released tax records B released tax records C didn't release any records

FWIW, I've talked to an accountant about the idea of Trump revealing his tax records and the bottom line is that they would be sufficiently complicated that there is no possibility that the average person would be able to interpret them accurately, so you'll be left with the spin from the various media organizations, hardly a source of objective truth.

So I would assert that requiring candidates to release their tax records doesn't actually provide any useful information for a voter.

Remember that Trump's tax records, are already examined by the IRS and I believe have been audited. So there shouldn't be any question of illegal activity being hidden, unless you want to assert that the IRS can't be trusted either.

There are also other concerns about tax records revealing information about 3rd parties. And finally tax records aren't really a useful way to understand the intricacies of a business. If you are really interested in that you would want the audit report for the underlying business and not just tax records.

[go to top]