zlacker

> They shouldn't collect _anything_ personally identifiable

Why not just ensure that any personally identifiable data is properly bucketed, and discarded if it is too strongly identifiable. If you are storing someone's height, age, and gender, you can just increase the bucket size for those fields until every combination of identifiable fields occurs several times in the dataset. If there are always a few different records with well distributed values for every combination of identifiable fields, you can't infer anything about an individual based on which buckets they fall into.

replies(1): >>majos+x1

>>GhostV+(OP)
Not a bad idea! It sounds pretty similar to k-anonymity [1], which is not a terrible privacy heuristic. But it does have some specific weaknesses. Wikipedia has a good description.

> Homogeneity Attack: This attack leverages the case where all the values for a sensitive value within a set of k records are identical. In such cases, even though the data has been k-anonymized, the sensitive value for the set of k records may be exactly predicted.

> Background Knowledge Attack: This attack leverages an association between one or more quasi-identifier attributes with the sensitive attribute to reduce the set of possible values for the sensitive attribute.

Optimal k-anonymization is also computationally hard [2].

[1] https://en.wikipedia.org/wiki/K-anonymity

[2] https://dl.acm.org/citation.cfm?id=1055591