Algorithm can pick out almost any American in supposedly anonymized databases

>>zoobab+(OP)
Would differential privacy fix this problem? I heard that new US census will use it.

>>polski+t7
Yes, in the sense that the output of a differentially private protocol has mathematical guarantees against re-identification, regardless of the computational power or side information an adversary has.

There are caveats. The exact strength of the privacy guarantee depends on the parameters you use and the number of computations you do, so simply saying "we use a differentially private algorithm" doesn't guarantee privacy in isolation.

>>majos+k8
do you have some examples?

>>shusso+x9
Of a differentially private algorithm? Frank McSherry (one of the authors of the original differential privacy paper) has a nice blog post introducing the idea and giving many examples with code [1].

Or even more briefly, if you want to know how many people in your database have characteristic X, you can compute that number and add Laplace(1/epsilon) noise [2] and output the result. That's epsilon-differentially private. In general, if you're computing a statistic that has sensitivity s (one person can change the statistic by at most s), then adding Laplace(s/epsilon) noise to the statistic makes it epsilon-differentially private (see e.g. Theorem 3.6 here [3]). The intuition is that, by scaling the added noise to the sensitivity, you cover up the presence or absence of any one individual.

[1] https://github.com/frankmcsherry/blog/blob/master/posts/2016...

[2] https://en.wikipedia.org/wiki/Laplace_distribution

[3] http://cis.upenn.edu/~aaroth/privacybook.html

>>majos+ia
Thanks for the links. I'm still a little confused by how differential privacy can be applied to non-aggregated fields. Can differentially private algorithms also be applied to mask/anonymise non-aggregated fields?

>>shusso+3c
You could, but if your statistic is a function of one person's data, differential privacy will force you to add enough noise to mask that one person's data, i.e. destroy almost all of the utility of the statistic.

It's possible to learn something by aggregating a bunch of those individually-privatized statistics. Randomized response [1] is a canonical example. More generally, local differential privacy is a stronger privacy model where users privatize their own data before releasing it for (arbitrary) analysis. As you might expect, the stronger privacy guarantee means worse utility, sometimes much worse [2].

[1] https://en.wikipedia.org/wiki/Randomized_response

zlacker