zlacker

We covered this in a DB course with the term k-anonymity which seems to be standard in the literature, where a dataset is k-anonymous if every combination of characteristics (that can identify users) has at least K users. So in your case that dataset has only the 1-anonymity property, but you can set a k>1 and change the data set to satisfy it and improve the anonymity. Eg. if age was just stored as 90+ and there's at least 10 90+ year olds in each wide area then you'd get 10-anonymity.

I guess then the interesting question is how high does k have to be to call it anonymous vs pseudonymous.

Also cool: this is how Have I been Pwned v2 works - if you send only the first 5 characters of a hash then it's guaranteed there's hundreds of matches and the server doesn't know the real password that had that hash prefix: https://www.troyhunt.com/ive-just-launched-pwned-passwords-v...

replies(3): >>SkyBel+3a >>tripzi+2p >>Commun+rs2

>>mcinty+(OP)
>I guess then the interesting question is how high does k have to be to call it anonymous vs pseudonymous.

I think that for any size k less than the total size of the database, it is not anonymous. In cases like this, an overly strict definition favoring privacy is the only way to protect people. Similar to how we call 17 year olds children and treat them as such under law even though a 17 year old is far closer to an 18 year old than they are to a 5 year old (yes, there are some exceptions, but these are all explicitly called out). Another example of such an extreme is concerning falsifying data or making false statements. Even a single such statement, regardless of the number of true statements, destroys credibility once found when trust is extremely important. This is why even a single such statement can get one found in contempt of court or destroy a scientist's entire career (and even cast doubt on peers who were innocent).

Overall it is quite messy because it is a mix of a technical problem with a people problem.

replies(2): >>Shenda+rB >>dwohni+XJ

>>mcinty+(OP)
Isn't part of that problem how you define a "characteristic" (and thus, the combinations of them)? Because if it has k-anonymity for a certain set of characteristics, but you invent a new characteristic (based from combining info, datamining and perhaps other sources), the property doesn't necessarily hold any more.

>>SkyBel+3a
> I think that for any size k less than the total size of the database, it is not anonymous.

Wouldn't that require that every field of every record in the database be globally unique?

If something as simple as gender is a field in the database, the best k you could get would be the lowest count of records of each existent gender option.

replies(1): >>SkyBel+TD

>>Shenda+rB
It would mean that any identifying data being included would result in the data set not being considered anonymous. It will mean we have to tell people the data being sent to other organizations and companies is not anonymous, which is what should happen anyways. No more hiding behind 'we took the bare minimum steps required, if more advanced statistical information de-anonymized the data it isn't our fault'.

>>SkyBel+3a
Is there any attribute of a person for which this is true?

I think what you're asking for is that any piece of data stored about someone be extensionally equivalent to "this is a human being" and no more which is not very useful (in an information-theoretic sense it has exactly zero use).

>>mcinty+(OP)
Part of the problem here too is that we may be dealing with more than one data set. If there is enough information overlap in those sets then you could map a record from one set to another. In that situation you'd have more data about each person, reducing the k-anonymity of the combined data sets to below the k-anonymity of either individual data set.

That concerns me most around places that process data for other companies (e.g., Cambridge Analytics, Facebook, Google, Amazon). These places could have access to many different data sets relating to a person, and could potentially combine these data sets to uniquely identify a single individual.

I recently looked at something that I gave a fake zip, birth date, and gender. Based on statistical probabilities it gave a 68% chance of a large data set having 1-anonymity. Wasn't clear what they were considering large, so could be bogus, but if true imagine what could easily be done with 10+ unique fields (e.g., zip, birthdate, gender, married?, # of children, ever smoked?, deductible amount, diabetes?, profession, BMI).

The earlier poster is right, only aggregate data is truly anonymous.