> The source code to reproduce the experiments is available at https://cpg.doc.ic.ac.uk/individual-risk, along with documentation, tests, and examples.
As far as I can tell, the source code is not available, at least not from where the authors suggest.
You are right that some differential privacy proofs have later been found to be wrong. For example, there is an entire paper about bugs in initial versions of the sparse vector technique [1].
However, I imagine this will evolve the way cryptographic security has evolved: at some point, enough experts have examined algorithm X to be confident about its differential privacy proof; then some experts implement it carefully; and the rest of us use their work because "rolling [our] own" is too tricky.
Or even more briefly, if you want to know how many people in your database have characteristic X, you can compute that number and add Laplace(1/epsilon) noise [2] and output the result. That's epsilon-differentially private. In general, if you're computing a statistic that has sensitivity s (one person can change the statistic by at most s), then adding Laplace(s/epsilon) noise to the statistic makes it epsilon-differentially private (see e.g. Theorem 3.6 here [3]). The intuition is that, by scaling the added noise to the sensitivity, you cover up the presence or absence of any one individual.
[1] https://github.com/frankmcsherry/blog/blob/master/posts/2016...
https://www.schneier.com/blog/archives/2019/07/google_releas...
https://www.microsoft.com/en-us/research/project/microsoft-s...
> Homogeneity Attack: This attack leverages the case where all the values for a sensitive value within a set of k records are identical. In such cases, even though the data has been k-anonymized, the sensitive value for the set of k records may be exactly predicted.
> Background Knowledge Attack: This attack leverages an association between one or more quasi-identifier attributes with the sensitive attribute to reduce the set of possible values for the sensitive attribute.
Optimal k-anonymization is also computationally hard [2].
Given a source dataset they create a synthetic dataset that has the same statistical properties (as defined at the point the synthetic dataset is created).
I've seen a demo, it's pretty slick https://synthesized.io/
It's possible to learn something by aggregating a bunch of those individually-privatized statistics. Randomized response [1] is a canonical example. More generally, local differential privacy is a stronger privacy model where users privatize their own data before releasing it for (arbitrary) analysis. As you might expect, the stronger privacy guarantee means worse utility, sometimes much worse [2].
edit: shameless plug. check out tonic.ai for a solution to the above problem.
I recommend watching it if you're interested at https://homepages.cwi.nl/~boncz/sigmod-pods2019.html (top-left vid)
(as a side-note Frank McSherry received SIGMOD Test Of Time Award for his Differential Privacy paper at the same conference).
I guess then the interesting question is how high does k have to be to call it anonymous vs pseudonymous.
Also cool: this is how Have I been Pwned v2 works - if you send only the first 5 characters of a hash then it's guaranteed there's hundreds of matches and the server doesn't know the real password that had that hash prefix: https://www.troyhunt.com/ive-just-launched-pwned-passwords-v...
> My go-to example to explain this is very simple: Let's
> say we reduce birthdate info to just your birthyear,
> and geoloc info to just a wide area. And then I have
> an pseudonimized individual who is marked down as
> being 105 years old.
> Usually there's only one such person.
I was interested to find that HIPAA's requirements for de-identification address the two particular issues you pointed out. First, age above some threshold (90) must be bucketed together as "older than 90." Second, regarding ZIP codes: you must zero out the last two digits. And then, if the resulting identifier contains less than 20,000 inhabitants according to the most recent US census, you have to blank the first three digits as well (there are currently 17 such three-digit prefixes).Source: Pages 96-97 of the combined legislation, available at: https://www.hhs.gov/hipaa/for-professionals/privacy/laws-reg...
You are allowed to roll your own de-identification method, as long as the person doing so is an expert on statistics and de-identification and they document their analysis of why their method is sound. To my knowledge, most entities use the "safe harbor" approach of wiping any data in the legislated blacklist of dimensions.
https://icml.cc/Conferences/2019/ScheduleMultitrack?event=43... there is a video link on this page