zlacker

Algorithm can pick out almost any American in supposedly anonymized databases

submitted by zoobab+(OP) on 2019-07-24 09:29:49 | 261 points 90 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
2. shusso+p4[view] [source] 2019-07-24 10:33:21
>>zoobab+(OP)
The paper referenced in the article: https://www.nature.com/articles/s41467-019-10933-3
◧◩
16. shakna+l7[view] [source] [discussion] 2019-07-24 11:12:33
>>ptah+i6
The paper suggests it can be accessed on the site [0], however certain parts of the site only appear if you run through their questionnaires.

> The source code to reproduce the experiments is available at https://cpg.doc.ic.ac.uk/individual-risk, along with documentation, tests, and examples.

As far as I can tell, the source code is not available, at least not from where the authors suggest.

[0] https://cpg.doc.ic.ac.uk/individual-risk/

◧◩
25. majos+M9[view] [source] [discussion] 2019-07-24 11:40:12
>>rectan+l4
> Unfortunately, Differential Privacy proofs can be used to justify applications which turn out to leak privacy when the proofs are shown to be incorrect after the fact, when the data is already out there and the damage already done.

You are right that some differential privacy proofs have later been found to be wrong. For example, there is an entire paper about bugs in initial versions of the sparse vector technique [1].

However, I imagine this will evolve the way cryptographic security has evolved: at some point, enough experts have examined algorithm X to be confident about its differential privacy proof; then some experts implement it carefully; and the rest of us use their work because "rolling [our] own" is too tricky.

[1] https://arxiv.org/abs/1603.01699

◧◩◪◨
27. majos+ia[view] [source] [discussion] 2019-07-24 11:43:34
>>shusso+x9
Of a differentially private algorithm? Frank McSherry (one of the authors of the original differential privacy paper) has a nice blog post introducing the idea and giving many examples with code [1].

Or even more briefly, if you want to know how many people in your database have characteristic X, you can compute that number and add Laplace(1/epsilon) noise [2] and output the result. That's epsilon-differentially private. In general, if you're computing a statistic that has sensitivity s (one person can change the statistic by at most s), then adding Laplace(s/epsilon) noise to the statistic makes it epsilon-differentially private (see e.g. Theorem 3.6 here [3]). The intuition is that, by scaling the added noise to the sensitivity, you cover up the presence or absence of any one individual.

[1] https://github.com/frankmcsherry/blog/blob/master/posts/2016...

[2] https://en.wikipedia.org/wiki/Laplace_distribution

[3] http://cis.upenn.edu/~aaroth/privacybook.html

◧◩
35. mtgx+Wb[view] [source] [discussion] 2019-07-24 11:59:47
>>polski+t7
Homomorphic encryption would be better:

https://www.schneier.com/blog/archives/2019/07/google_releas...

https://www.microsoft.com/en-us/research/project/microsoft-s...

◧◩◪
38. majos+ac[view] [source] [discussion] 2019-07-24 12:01:55
>>GhostV+Da
Not a bad idea! It sounds pretty similar to k-anonymity [1], which is not a terrible privacy heuristic. But it does have some specific weaknesses. Wikipedia has a good description.

> Homogeneity Attack: This attack leverages the case where all the values for a sensitive value within a set of k records are identical. In such cases, even though the data has been k-anonymized, the sensitive value for the set of k records may be exactly predicted.

> Background Knowledge Attack: This attack leverages an association between one or more quasi-identifier attributes with the sensitive attribute to reduce the set of possible values for the sensitive attribute.

Optimal k-anonymization is also computationally hard [2].

[1] https://en.wikipedia.org/wiki/K-anonymity

[2] https://dl.acm.org/citation.cfm?id=1055591

41. richma+Pc[view] [source] 2019-07-24 12:08:17
>>zoobab+(OP)
There's a startup in London called Synthesized working on part of this problem space.

Given a source dataset they create a synthetic dataset that has the same statistical properties (as defined at the point the synthetic dataset is created).

I've seen a demo, it's pretty slick https://synthesized.io/

◧◩◪◨⬒⬓
44. majos+gd[view] [source] [discussion] 2019-07-24 12:11:11
>>shusso+3c
You could, but if your statistic is a function of one person's data, differential privacy will force you to add enough noise to mask that one person's data, i.e. destroy almost all of the utility of the statistic.

It's possible to learn something by aggregating a bunch of those individually-privatized statistics. Randomized response [1] is a canonical example. More generally, local differential privacy is a stronger privacy model where users privatize their own data before releasing it for (arbitrary) analysis. As you might expect, the stronger privacy guarantee means worse utility, sometimes much worse [2].

[1] https://en.wikipedia.org/wiki/Randomized_response

45. akamor+Ld[view] [source] 2019-07-24 12:15:59
>>zoobab+(OP)
There are also way too many people with access to non-anonymized data. i.e. the development team that has read privileges on the production database. e.g. employees at uber spying on customers (https://www.theguardian.com/technology/2016/dec/13/uber-empl...).

edit: shameless plug. check out tonic.ai for a solution to the above problem.

◧◩◪◨
52. polski+Lf[view] [source] [discussion] 2019-07-24 12:30:20
>>shusso+x9
There was a keynote about Differential Privacy and Census example at the recent PODS/SIGMOD by Cynthia Dwork.

I recommend watching it if you're interested at https://homepages.cwi.nl/~boncz/sigmod-pods2019.html (top-left vid)

(as a side-note Frank McSherry received SIGMOD Test Of Time Award for his Differential Privacy paper at the same conference).

◧◩
60. mcinty+Gk[view] [source] [discussion] 2019-07-24 13:04:38
>>rzwits+e6
We covered this in a DB course with the term k-anonymity which seems to be standard in the literature, where a dataset is k-anonymous if every combination of characteristics (that can identify users) has at least K users. So in your case that dataset has only the 1-anonymity property, but you can set a k>1 and change the data set to satisfy it and improve the anonymity. Eg. if age was just stored as 90+ and there's at least 10 90+ year olds in each wide area then you'd get 10-anonymity.

I guess then the interesting question is how high does k have to be to call it anonymous vs pseudonymous.

Also cool: this is how Have I been Pwned v2 works - if you send only the first 5 characters of a hash then it's guaranteed there's hundreds of matches and the server doesn't know the real password that had that hash prefix: https://www.troyhunt.com/ive-just-launched-pwned-passwords-v...

◧◩
67. lmkg+To[view] [source] [discussion] 2019-07-24 13:32:53
>>rzwits+e6

  > My go-to example to explain this is very simple: Let's 
  > say we reduce birthdate info to just your birthyear, 
  > and geoloc info to just a wide area. And then I have 
  > an pseudonimized individual who is marked down as 
  > being 105 years old.

  > Usually there's only one such person.
I was interested to find that HIPAA's requirements for de-identification address the two particular issues you pointed out. First, age above some threshold (90) must be bucketed together as "older than 90." Second, regarding ZIP codes: you must zero out the last two digits. And then, if the resulting identifier contains less than 20,000 inhabitants according to the most recent US census, you have to blank the first three digits as well (there are currently 17 such three-digit prefixes).

Source: Pages 96-97 of the combined legislation, available at: https://www.hhs.gov/hipaa/for-professionals/privacy/laws-reg...

You are allowed to roll your own de-identification method, as long as the person doing so is an expert on statistics and de-identification and they document their analysis of why their method is sound. To my knowledge, most entities use the "safe harbor" approach of wiping any data in the legislated blacklist of dimensions.

◧◩
76. cschmi+BC[view] [source] [discussion] 2019-07-24 15:00:03
>>polski+t7
At ICML 2019 there was a good keynote by the chief scientist of the Census Bureau. The 2010 census was before a good understanding of differential privacy, so it wasn't really done correctly. The census bureau actually went back and bought a bunch of commercially available data, and deanonymized ~70% of the individuals in the US (if I remember correctly). So they wanted to do it better this time, and they seem to be taking the state of the art into account. It was much more impressive than I was expecting, actually.

https://icml.cc/Conferences/2019/ScheduleMultitrack?event=43... there is a video link on this page

◧◩◪◨⬒
79. Cynddl+SK[view] [source] [discussion] 2019-07-24 15:49:27
>>mnky98+ms
The article is available here, in open access: https://www.nature.com/articles/s41467-019-10933-3
[go to top]