Algorithm can pick out almost any American in supposedly anonymized databases

>>zoobab+(OP)
I'm a programmer in the GP data analysis world. We use the term 'pseudonymization' for this kind of data. 'Anonymization' is used solely to refer to, say, 'the sum total of diabetes patients this practice has' (that would be anonymous patient data; it would not be anonymous relative to the GP office this refers to): Aggregated data that can no longer be reduced to a single individual at all.

The term raises questions: Okay, so, what does it mean? How 'pseudo' is psuedo? And that's the point: When you pseudonimize data, you must ask those questions and there is no black and white anymore.

My go-to example to explain this is very simple: Let's say we reduce birthdate info to just your birthyear, and geoloc info to just a wide area. And then I have an pseudonimized individual who is marked down as being 105 years old.

Usually there's only one such person.

I invite everybody who works in this field to start using the term 'pseudonimization'.

>>rzwits+e6
We covered this in a DB course with the term k-anonymity which seems to be standard in the literature, where a dataset is k-anonymous if every combination of characteristics (that can identify users) has at least K users. So in your case that dataset has only the 1-anonymity property, but you can set a k>1 and change the data set to satisfy it and improve the anonymity. Eg. if age was just stored as 90+ and there's at least 10 90+ year olds in each wide area then you'd get 10-anonymity.

I guess then the interesting question is how high does k have to be to call it anonymous vs pseudonymous.

Also cool: this is how Have I been Pwned v2 works - if you send only the first 5 characters of a hash then it's guaranteed there's hundreds of matches and the server doesn't know the real password that had that hash prefix: https://www.troyhunt.com/ive-just-launched-pwned-passwords-v...

>>mcinty+Gk
Part of the problem here too is that we may be dealing with more than one data set. If there is enough information overlap in those sets then you could map a record from one set to another. In that situation you'd have more data about each person, reducing the k-anonymity of the combined data sets to below the k-anonymity of either individual data set.

That concerns me most around places that process data for other companies (e.g., Cambridge Analytics, Facebook, Google, Amazon). These places could have access to many different data sets relating to a person, and could potentially combine these data sets to uniquely identify a single individual.

I recently looked at something that I gave a fake zip, birth date, and gender. Based on statistical probabilities it gave a 68% chance of a large data set having 1-anonymity. Wasn't clear what they were considering large, so could be bogus, but if true imagine what could easily be done with 10+ unique fields (e.g., zip, birthdate, gender, married?, # of children, ever smoked?, deductible amount, diabetes?, profession, BMI).

The earlier poster is right, only aggregate data is truly anonymous.

zlacker