Algorithm can pick out almost any American in supposedly anonymized databases

>>zoobab+(OP)
If people tell you they're collecting data for statistical purposes, then one of three things:

1. They should deliberately introduce noise into the raw data. Nazis with the raw census data can spend all month trying to find the two 40-something Jews that data says live on this island of 8400 people, but they were just noise. Or were they? No way to know.

2. Bucket everything and discard all raw data immediately. This hampers future analysis, so the buckets must be chosen carefully, but it is often enough for real statistical work, and often you could just collect data again later if you realise you needed different buckets.

3. They shouldn't collect _anything_ personally identifiable. Hard because this could be almost anything at all. If you're 180cm tall your height doesn't seem personally identifiable, but ask Sun Mingming. If you own a Honda Civic then model of car doesn't seem personally identifiable but ask somebody in a Rolls Royce Wraith Luminary...

>>tialar+x4
To points 1 and 2: It's proven very difficult to sanitize datasets in a way that ensures anonymity, but doesn't render it useless. You aren't the first to think of these kinds of transformations.

There are problems with Point 3: we're continually surprised with how effectively smart people can identify people in datasets expected to be 'safe'. You've also not accounted for that a collection of non-identifying attributes may become identifying.

That said, the GDPR is largely about prohibiting unnecessary data collection, in the spirit of Point 3. Hopefully it'll help at least a little.

zlacker