Algorithm can pick out almost any American in supposedly anonymized databases

>>zoobab+(OP)
I'm a programmer in the GP data analysis world. We use the term 'pseudonymization' for this kind of data. 'Anonymization' is used solely to refer to, say, 'the sum total of diabetes patients this practice has' (that would be anonymous patient data; it would not be anonymous relative to the GP office this refers to): Aggregated data that can no longer be reduced to a single individual at all.

The term raises questions: Okay, so, what does it mean? How 'pseudo' is psuedo? And that's the point: When you pseudonimize data, you must ask those questions and there is no black and white anymore.

My go-to example to explain this is very simple: Let's say we reduce birthdate info to just your birthyear, and geoloc info to just a wide area. And then I have an pseudonimized individual who is marked down as being 105 years old.

Usually there's only one such person.

I invite everybody who works in this field to start using the term 'pseudonimization'.

>>rzwits+e6
We covered this in a DB course with the term k-anonymity which seems to be standard in the literature, where a dataset is k-anonymous if every combination of characteristics (that can identify users) has at least K users. So in your case that dataset has only the 1-anonymity property, but you can set a k>1 and change the data set to satisfy it and improve the anonymity. Eg. if age was just stored as 90+ and there's at least 10 90+ year olds in each wide area then you'd get 10-anonymity.

I guess then the interesting question is how high does k have to be to call it anonymous vs pseudonymous.

Also cool: this is how Have I been Pwned v2 works - if you send only the first 5 characters of a hash then it's guaranteed there's hundreds of matches and the server doesn't know the real password that had that hash prefix: https://www.troyhunt.com/ive-just-launched-pwned-passwords-v...

>>mcinty+Gk
>I guess then the interesting question is how high does k have to be to call it anonymous vs pseudonymous.

I think that for any size k less than the total size of the database, it is not anonymous. In cases like this, an overly strict definition favoring privacy is the only way to protect people. Similar to how we call 17 year olds children and treat them as such under law even though a 17 year old is far closer to an 18 year old than they are to a 5 year old (yes, there are some exceptions, but these are all explicitly called out). Another example of such an extreme is concerning falsifying data or making false statements. Even a single such statement, regardless of the number of true statements, destroys credibility once found when trust is extremely important. This is why even a single such statement can get one found in contempt of court or destroy a scientist's entire career (and even cast doubt on peers who were innocent).

Overall it is quite messy because it is a mix of a technical problem with a people problem.

>>SkyBel+Ju
> I think that for any size k less than the total size of the database, it is not anonymous.

Wouldn't that require that every field of every record in the database be globally unique?

If something as simple as gender is a field in the database, the best k you could get would be the lowest count of records of each existent gender option.

>>Shenda+7W
It would mean that any identifying data being included would result in the data set not being considered anonymous. It will mean we have to tell people the data being sent to other organizations and companies is not anonymous, which is what should happen anyways. No more hiding behind 'we took the bare minimum steps required, if more advanced statistical information de-anonymized the data it isn't our fault'.

zlacker