Algorithm can pick out almost any American in supposedly anonymized databases

>>zoobab+(OP)
The great contribution of Differential Privacy theory is to quantify just how little use you can get out of aggregate data before individuals become identifiable.

Unfortunately, Differential Privacy proofs can be used to justify applications which turn out to leak privacy when the proofs are shown to be incorrect after the fact, when the data is already out there and the damage already done.

Nevertheless, it is instructive just to see how perilously few queries can be answered before compromise occurs — putting the lie to the irresponsible idea of "anonymization".

replies(2): >>majos+M9 >>specia+wl

>>zoobab+(OP)
The paper referenced in the article: https://www.nature.com/articles/s41467-019-10933-3

>>zoobab+(OP)
If people tell you they're collecting data for statistical purposes, then one of three things:

1. They should deliberately introduce noise into the raw data. Nazis with the raw census data can spend all month trying to find the two 40-something Jews that data says live on this island of 8400 people, but they were just noise. Or were they? No way to know.

2. Bucket everything and discard all raw data immediately. This hampers future analysis, so the buckets must be chosen carefully, but it is often enough for real statistical work, and often you could just collect data again later if you realise you needed different buckets.

3. They shouldn't collect _anything_ personally identifiable. Hard because this could be almost anything at all. If you're 180cm tall your height doesn't seem personally identifiable, but ask Sun Mingming. If you own a Honda Civic then model of car doesn't seem personally identifiable but ask somebody in a Rolls Royce Wraith Luminary...

replies(3): >>pas+N5 >>GhostV+Da >>MaxBar+vx1

>>zoobab+(OP)
The article title misses a bit of nuance from the paper which is specifically talking about re-identification.

e.g from the paper:

"We show that, as a male born on July 31, 1945 and living in Cambridge (02138), the information used by Latanya Sweeney at the time, William Weld was unique with a 58% likelihood (ξx = 0.58 and κx = 0.77), meaning that Latanya Sweeney’s re-identification had 77% chances of being correct. We show that, if his medical records had included number of children—5 for William Weld—, her re-identification would have had 99.8% chances of being correct!"

replies(1): >>Freak_+87

>>zoobab+(OP)
We are going to have to get used to and start to develop a legal and moral framework around the fact that it will soon be impossible to stay in the shadows.

>>tialar+x4
Nazis used the census because it was there. Were it not, they would have just went house to house - which they then regularly did anyway.

The antidote of oppression is not the Index statisticum prohibitorum, but quite the opposite, education, and in particular educating about how different each and every one of us is, and yet it doesn't take much to get along well.

replies(1): >>pjc50+j6

>>zoobab+(OP)
I'm a programmer in the GP data analysis world. We use the term 'pseudonymization' for this kind of data. 'Anonymization' is used solely to refer to, say, 'the sum total of diabetes patients this practice has' (that would be anonymous patient data; it would not be anonymous relative to the GP office this refers to): Aggregated data that can no longer be reduced to a single individual at all.

The term raises questions: Okay, so, what does it mean? How 'pseudo' is psuedo? And that's the point: When you pseudonimize data, you must ask those questions and there is no black and white anymore.

My go-to example to explain this is very simple: Let's say we reduce birthdate info to just your birthyear, and geoloc info to just a wide area. And then I have an pseudonimized individual who is marked down as being 105 years old.

Usually there's only one such person.

I invite everybody who works in this field to start using the term 'pseudonimization'.

replies(11): >>cheez+A6 >>philba+C6 >>mgkims+V6 >>TadaSc+68 >>abcc8+19 >>martin+7f >>dalbas+ai >>mcinty+Gk >>jklehm+um >>lmkg+To >>heaven+eF

>>zoobab+(OP)
where's the link to code?

replies(1): >>shakna+l7

>>pas+N5
Today's Nazis would probably use machine learning to identify "characteristically Jewish facial features".

(Or, of course, simply pass a law that entitles them to deport or hold indefinitely anyone who can't prove they're a real Ayrian with only the papers they have on them at the time)

replies(2): >>wayout+5b >>tal8d+Zc

>>zoobab+(OP)
In any database whose fields have at least 4 meaningful ranges, 16 fields give 4.000.000.000 possibilities. Now I am on my iphone. As long as ranges are meaningful (i.e. they do divide the group in somewhat even parts), the individuating possibilities are HUGE. And fields do have usually many more than 4 ranges.

Anonymizing datasets is a weasel term.

The database is secure or it is not. As any database is quite likely insecure, we are doomed.

replies(1): >>jobigo+4w

>>rzwits+e6
It doesn't roll off the tongue, perhaps pseudo-anonymization is enough.

replies(4): >>gervu+B8 >>Berisl+aa >>shardi+Pb >>Zenst+Ub

>>rzwits+e6
When I worked in the data analysis world, we would have a bounds for this stuff.

Number of people >100 in the area: <10

>>zoobab+(OP)
When information you are recieving has been "anonymized", how do you tell it is accurate? If you are the person collecting and storing said sensative information, you will know what is and is not accurate, but nobody else knows.

For some kinds of information, like medical records, the information is deadly not to have accurate, but also deadly to have accurate and public. Once the information leaks, employers might decide to not hire high-risk people or insurers might decide to pass over certain people as too costly.

I'm of the opinion "anonymizing" data is something that enables grifting; if enables the collectors to placate the people they are pulling data from, and it allows the grifters to make the argument the information they have means nothing.

Ultimately, I think these organizations should be making sure their information is absolutely accurate, and we should have laws in place, with severe criminal penalties, against the use transfer or use of said information. I would even go so far as to say things like cell phone location records should be fully public as a matter of the law.

Now when you want to get those records, you go to a government website for the "hunt and poke" stuff (e.g. where are my kids going or is my wife spending time with another lover, how long is my commute on average, or where was I at 3 years ago on a day, all sorts of useful questions); the access records are public too.

If you want to study them, you sign a NDA saying you won't, under penalty of severe criminal prosecution, leak the information or use it for criminal purposes. Anyone found having the data and no signed government NDA = instant 20 year prison sentance plus felony conviction.

This way, if, for example, someone signs the NDA and goes on to offer services to executives to help them cherry pick staff, not only does the person offering the under the table services go to jail, but the executive does as well.

When you criminalize certain things, then give the public all the information and tools to do as they see fit, the law works. It's a lot easier to prosecute a company executive for cherry-picking staff with insurance data when the data is well-labeled. It is also a lot easier to sue them when you have an access record that says someone under their employ checked how often you go to a clinic or night club via your cellphone records.

The problem is not going away anyway, and "anonymizing" data to placate our sense of morality isn't going to help. There is no easy technical solution, but if the thinking is not to anonymize but instead track and enforce who has access, things change drastically.

replies(1): >>raxxor+5c

>>rzwits+e6
I worked on a reporting/dataviz project a few years back, showing survey results from states/districts/schools. All the data was included in the totals, but if a school had 5 or fewer responses, we didn't allow viewing of the school data. It was specifically because being able to see answers to things like "do you support your principal?" when there were only, say, 3 teachers at a school, and seeing 2 for "no"... it was way too easy to determine who the respondents were. Even with '5' as a cutoff, it still felt a bit dicey for some schools.

>>shusso+O4
What accounts for that remaining 2‰ of uncertainty?

replies(3): >>air7+la >>bsanr2+2e >>Cynddl+ke

>>ptah+i6
The paper suggests it can be accessed on the site [0], however certain parts of the site only appear if you run through their questionnaires.

> The source code to reproduce the experiments is available at https://cpg.doc.ic.ac.uk/individual-risk, along with documentation, tests, and examples.

As far as I can tell, the source code is not available, at least not from where the authors suggest.

[0] https://cpg.doc.ic.ac.uk/individual-risk/

replies(1): >>Cynddl+Vd

>>zoobab+(OP)
Would differential privacy fix this problem? I heard that new US census will use it.

replies(3): >>majos+k8 >>mtgx+Wb >>cschmi+BC

>>rzwits+e6
Pseudo and psuedo are derived from the greek work ψεμα psema, which literally means lie. It's synonym is 'untrue' which indicates that your term is the right term to use when anonymizing-but-not-really-anonymizing

replies(1): >>contra+Z8

>>polski+t7
Yes, in the sense that the output of a differentially private protocol has mathematical guarantees against re-identification, regardless of the computational power or side information an adversary has.

There are caveats. The exact strength of the privacy guarantee depends on the parameters you use and the number of computations you do, so simply saying "we use a differentially private algorithm" doesn't guarantee privacy in isolation.

replies(1): >>shusso+x9

>>cheez+A6
Pseudonymization already refers to reference by pseudonym.

Pseudonimization is bad terminology in that it's indistinct from the above, to the point that parent has already mixed the two up while in the process of recommending it. And it'd be worse verbally.

"Pseudo-anonymization" could work, but something like "breakable anonymization" or "partial anonymization" might be better in that it's more obvious to a reader and doesn't rely on familiarity with technical terminology to convey the idea.

I'd go with breakable, myself, since it's most to the point about why it's a problem.

Pseudo is etymologically correct, but that doesn't necessarily help us much when the goal is ratio and ease of understanding by a wide population of readers.

Partial could work in the sense that you did part of the job, which people would hopefully understand is a bit like having locked the back door for the night while leaving the front propped wide open.

And there are probably other good options. If I was writing about this topic often, I'd strongly consider brainstorming a few more and running a user test where I ask random people to explain each term, then go with what consistently gets results closest to what I'm trying to discuss.

replies(3): >>nobrai+G9 >>washad+Tb >>hiccup+dk

>>TadaSc+68
In this case it might not be 'pseudo'-anonymization but rather 'pseudonym'-ization. At least that seemed to be the intent when I last heard the term.

>>rzwits+e6
We refer to such data as deidentified.

>>majos+k8
do you have some examples?

replies(2): >>majos+ia >>polski+Lf

>>gervu+B8
Its not partial. Its pseudo.

>>rectan+l4
> Unfortunately, Differential Privacy proofs can be used to justify applications which turn out to leak privacy when the proofs are shown to be incorrect after the fact, when the data is already out there and the damage already done.

You are right that some differential privacy proofs have later been found to be wrong. For example, there is an entire paper about bugs in initial versions of the sparse vector technique [1].

However, I imagine this will evolve the way cryptographic security has evolved: at some point, enough experts have examined algorithm X to be confident about its differential privacy proof; then some experts implement it carefully; and the rest of us use their work because "rolling [our] own" is too tricky.

[1] https://arxiv.org/abs/1603.01699

>>cheez+A6
I would suggest "denonymization", implying that the "names" (i.e. identifiable information) were removed. As opposed to anonymous data, which presumably never had any to start with, like the example mentioned by the GP.

replies(2): >>leetcr+7n >>jobigo+eu

>>shusso+x9
Of a differentially private algorithm? Frank McSherry (one of the authors of the original differential privacy paper) has a nice blog post introducing the idea and giving many examples with code [1].

Or even more briefly, if you want to know how many people in your database have characteristic X, you can compute that number and add Laplace(1/epsilon) noise [2] and output the result. That's epsilon-differentially private. In general, if you're computing a statistic that has sensitivity s (one person can change the statistic by at most s), then adding Laplace(s/epsilon) noise to the statistic makes it epsilon-differentially private (see e.g. Theorem 3.6 here [3]). The intuition is that, by scaling the added noise to the sensitivity, you cover up the presence or absence of any one individual.

[1] https://github.com/frankmcsherry/blog/blob/master/posts/2016...

[2] https://en.wikipedia.org/wiki/Laplace_distribution

[3] http://cis.upenn.edu/~aaroth/privacybook.html

replies(1): >>shusso+3c

>>Freak_+87
Good question. I'd also assume that re-identification chances would always be of the form 100/k (k being the integer number of people who fit the bill)

>>tialar+x4
> They shouldn't collect _anything_ personally identifiable

Why not just ensure that any personally identifiable data is properly bucketed, and discarded if it is too strongly identifiable. If you are storing someone's height, age, and gender, you can just increase the bucket size for those fields until every combination of identifiable fields occurs several times in the dataset. If there are always a few different records with well distributed values for every combination of identifiable fields, you can't infer anything about an individual based on which buckets they fall into.

replies(1): >>majos+ac

>>pjc50+j6
Probably the latter, considering any sort of genetic or scientific testing that was reasonably accurate at identifying people with Jewish ancestry would also implicate much of German high command (including Hitler).

Remember; fascists don't believe in things because they are true, but because they are a means to an end. Their ultimate goal is authoritarian control, and an administrative mechanism is far more effective toward that goal than a scientific one.

>>zoobab+(OP)
If they can pick out individuals from the data, then the data is not anonymized. Sure they may have unassociated data spread across unassociated records, but if an algorithm can pick it out, then so could a human (though way more effort). That for me is not anonymized data.

replies(1): >>Alexan+Ic

>>cheez+A6
Just joking, but how about 'nonymisation' or 'non-nonimysation'

>>gervu+B8
'Pseudononymization' is how my brain read it. It's a fairly self-explanatory portmanteau with little chance of confusion on the root.

replies(1): >>gervu+kc

>>cheez+A6
True, maybe call it "redacted with a highlighting pen" as appears to get the same results.

>>polski+t7
Homomorphic encryption would be better:

https://www.schneier.com/blog/archives/2019/07/google_releas...

https://www.microsoft.com/en-us/research/project/microsoft-s...

>>majos+ia
Thanks for the links. I'm still a little confused by how differential privacy can be applied to non-aggregated fields. Can differentially private algorithms also be applied to mask/anonymise non-aggregated fields?

replies(1): >>majos+gd

>>TheBob+T6
You should be able to request any personal data to be deleted. If the company in question leaks anything, they are responsible. Here the bonkers american style punishments might actually be the way to go.

That aside, I would like the option that says "do not collect the data". It wouldn't even be hard.

Sure there is knowledge and advantages in that data, but that doesn't even come close to the benefits of privacy. Think the general public opinion about X is pretty stupid? If so, you'll need it too.

replies(1): >>TheBob+Uc

>>GhostV+Da
Not a bad idea! It sounds pretty similar to k-anonymity [1], which is not a terrible privacy heuristic. But it does have some specific weaknesses. Wikipedia has a good description.

> Homogeneity Attack: This attack leverages the case where all the values for a sensitive value within a set of k records are identical. In such cases, even though the data has been k-anonymized, the sensitive value for the set of k records may be exactly predicted.

> Background Knowledge Attack: This attack leverages an association between one or more quasi-identifier attributes with the sensitive attribute to reduce the set of possible values for the sensitive attribute.

Optimal k-anonymization is also computationally hard [2].

[1] https://en.wikipedia.org/wiki/K-anonymity

[2] https://dl.acm.org/citation.cfm?id=1055591

>>washad+Tb
"Pseudonymization but with an extra syllable" sounds just as confusing as the other two. I wouldn't be sure which of those words I was looking at on first glance, which is what you'd need for it to be casually readable.

>>Zenst+Jb
> That for me is not anonymized data.

What matters is that this is what most online companies (and their terms of service) would call anonymized data.

replies(1): >>q-base+qe

>>zoobab+(OP)
There's a startup in London called Synthesized working on part of this problem space.

Given a source dataset they create a synthetic dataset that has the same statistical properties (as defined at the point the synthetic dataset is created).

I've seen a demo, it's pretty slick https://synthesized.io/

>>raxxor+5c
The idea you don't need to trust society and the government is very attractive when ridiculous abuse has been done to you and you've had to manage the fall-out.

>>pjc50+j6
This hypothetical 21st century National Socialist German Workers' Party would likely be a little more subtle than blatantly enforcing existing immigration law. They'd obviously be interested in establishing a common enemy, even if it requires massive exaggeration mixed with complete fabrication. That would do them no good if domestically targeted propaganda were illegal though... hopefully the combination of laws and executive orders preventing that don't get relaxed by the two administrations proceeding this cringy edgelord scenario.

>>shusso+3c
You could, but if your statistic is a function of one person's data, differential privacy will force you to add enough noise to mask that one person's data, i.e. destroy almost all of the utility of the statistic.

It's possible to learn something by aggregating a bunch of those individually-privatized statistics. Randomized response [1] is a canonical example. More generally, local differential privacy is a stronger privacy model where users privatize their own data before releasing it for (arbitrary) analysis. As you might expect, the stronger privacy guarantee means worse utility, sometimes much worse [2].

[1] https://en.wikipedia.org/wiki/Randomized_response

>>zoobab+(OP)
There are also way too many people with access to non-anonymized data. i.e. the development team that has read privileges on the production database. e.g. employees at uber spying on customers (https://www.theguardian.com/technology/2016/dec/13/uber-empl...).

edit: shameless plug. check out tonic.ai for a solution to the above problem.

>>shakna+l7
Co-author here, we will add the source code very soon. A little swamped with the press coverage, but the source code in Julia+Python is coming.

>>zoobab+(OP)
Paywwall...

>>Freak_+87
Perhaps the risk that some of the data the probability is based on is wrong, e.g., "She fits the bill except they only have 4 children and the 5th is an entry error, so it's not her."

>>Freak_+87
Co-author here. We designed a statistical model, which is never 100% sure a re-identification is correct. There is, e.g., a non-null probability that two individuals in the US share 5, 10, or even 15 demographics attribute.

replies(1): >>mnky98+ms

>>Alexan+Ic
If they operate in Europe then I am pretty sure that the GDPR legislation is pretty straight forward here. If you can de-anonymize the data then it is by definition not anonymized.

replies(2): >>Zenst+Tf >>Shaani+Jg

>>rzwits+e6
I'll coin the phrase Sominization, where 'some' attempt to anonymize the data, and as such, should be used with great caution.

>>shusso+x9
There was a keynote about Differential Privacy and Census example at the recent PODS/SIGMOD by Cynthia Dwork.

I recommend watching it if you're interested at https://homepages.cwi.nl/~boncz/sigmod-pods2019.html (top-left vid)

(as a side-note Frank McSherry received SIGMOD Test Of Time Award for his Differential Privacy paper at the same conference).

>>q-base+qe
No idea why your comment was marked as dead as you raise a very valid point. vouched.

>>zoobab+(OP)
While things like this sound scary, deanonymization turns out to be not exceedingly impactful in practice. Most entities have no desire to deanonymize you, to them, the most useful format is to treat you as a GUID with a bag of attributes attached to it.

Of the entities that remain, they fall into two buckets: Ones powerful enough that they already have personally identifiable data without the need to deanonymize anonymous data sets and ones small enough that they don't have the capabilities to deanonymize.

If you're a government, you don't need to rely on anonymized data sets, you have the sets with the labels already. If you're a stalker or internet troll or whatever, it's far easier to just pay one of the PI websites $29 to get far more data on a person than any deanonymized dataset will give you.

replies(2): >>samfri+8j >>m463+Wl2

>>q-base+qe
That seems correct.

From an article on the subject:

>Recital 26 of the GDPR defines anonymized data as “data rendered anonymous in such a way that the data subject is not or no longer identifiable.” Although circular, this definition emphasizes that anonymized data must be stripped of any identifiable information, making it impossible to derive insights on a discreet individual, even by the party that is responsible for the anonymization.

replies(1): >>shusso+ii

>>rzwits+e6
A related/derivative concept that ties into the existing lexicon is name Vs pseudonym.

Bitcoin for example, uses pseudonyms... not anonymity. A pen name or a hn username is a pseudonym. A voting system needs to be anonymous not pseudo-anonymous^, by using a pseudonym. If each voter had a secret number that is attached to each vote, that is a pseudonym.

"L is 32 years old. She works as a nurse in Moscow." - L is a pseudonym. It isn't anonymous even though the name is ommitted.

^This can get grey, as even a piece of paper with an X on it will carry certain metadata or related data: which voting booth, etc. But, the goal is anonymity. IE, the X cannot be tied to anything else.

replies(1): >>doomju+W03

>>Shaani+Jg
it is not clear to me if this covers re-identifying.

replies(1): >>Zenst+Am

>>shalma+Yf
I think there are still threats that come from improperly anonymized data, beyond my own government or a single stalker/harasser.

For one, that this data is improperly anonymized would make it an easy avenue for malicious nation-state actors to use to track/analyze/destabilize the population. If I am a government with an interest in freaking out the US public, I could quite easily de-anonymize sensitive datasets and begin using them for wide-scale harassment, identity theft, etc. on an automated basis.

The lowering of the bar makes it easier for Johnny Troublemaker to start harassing people based on their PII as well. Instead of paying for the data, just download some datasets and run a Julia notebook against them. Maybe not much changes for the targeted stalking case, but now you can cast a wide net when looking for someone to mess with.

replies(1): >>shalma+Vk

>>gervu+B8
ESL here, what's wrong with pseudonymization as a derivative of pseudonym? I think it brings the point across. But if we want to stop beating around the bush, what about "nameless identification"?

replies(1): >>Whitne+4n

>>rzwits+e6
We covered this in a DB course with the term k-anonymity which seems to be standard in the literature, where a dataset is k-anonymous if every combination of characteristics (that can identify users) has at least K users. So in your case that dataset has only the 1-anonymity property, but you can set a k>1 and change the data set to satisfy it and improve the anonymity. Eg. if age was just stored as 90+ and there's at least 10 90+ year olds in each wide area then you'd get 10-anonymity.

I guess then the interesting question is how high does k have to be to call it anonymous vs pseudonymous.

Also cool: this is how Have I been Pwned v2 works - if you send only the first 5 characters of a hash then it's guaranteed there's hundreds of matches and the server doesn't know the real password that had that hash prefix: https://www.troyhunt.com/ive-just-launched-pwned-passwords-v...

replies(3): >>SkyBel+Ju >>tripzi+IJ >>Commun+7N2

>>samfri+8j
Malicious nation state actors can easily access PII without the need for deanonymization, simply by buying it on the open market.

The number of Johnny Troublemakers who are randomly spraying hate based on PII is about the same as the rate of people throwing rocks off highway overpasses onto cars below. It's simply not a significant enough problem to be worth worrying about.

>>rectan+l4
The part I can't wrap my head around is how to mitigate (future proof) unforeseen leakage and correlations. The example I keep going back to is deanonymizing movie reviews (chronologically correlating movie rentals and reviews). And, frankly, I'm just not clever enough to imagine most attacks.

If nothing else, I appreciate the Differential Privacy effort, if only to show the problem space is wicked hard.

I worked in medical records and protecting voter privacy. There's a lot of wishful thinking leading to unsafe practices. Having better models to describe what's what would be nice.

replies(1): >>bo1024+3x

>>rzwits+e6
As far as health data is concerned the regulation is very clear that any aggregates must have a certain minimum amount in bin for precisely this reason.

>>shusso+ii
I'd say that is pretty clear-cut with "making it impossible to derive insights on a discreet individual"

If it is possible, it's not anonymous per GDPR's definition and that is what counts.

replies(1): >>shusso+bs

>>hiccup+dk
>if we want to stop beating around the bush, what about "nameless identification"?

The difference is the other proposed alternatives more directly suggest risk is involved.

It's a nice ESL example because technically, I don't think you're suggestion is wrong. In practice I think few would infer its implications.

>>Berisl+aa
from an etymological perspective, this is kind of gross; you are grafting a latin-derived prefix onto a greek-derived stem. there's no direct Greek equivalent to the Latin "de", but you could consider "apo" or possibly "kata", both of which have similar meanings.

>>rzwits+e6

  > My go-to example to explain this is very simple: Let's 
  > say we reduce birthdate info to just your birthyear, 
  > and geoloc info to just a wide area. And then I have 
  > an pseudonimized individual who is marked down as 
  > being 105 years old.

  > Usually there's only one such person.

I was interested to find that HIPAA's requirements for de-identification address the two particular issues you pointed out. First, age above some threshold (90) must be bucketed together as "older than 90." Second, regarding ZIP codes: you must zero out the last two digits. And then, if the resulting identifier contains less than 20,000 inhabitants according to the most recent US census, you have to blank the first three digits as well (there are currently 17 such three-digit prefixes).

Source: Pages 96-97 of the combined legislation, available at: https://www.hhs.gov/hipaa/for-professionals/privacy/laws-reg...

You are allowed to roll your own de-identification method, as long as the person doing so is an expert on statistics and de-identification and they document their analysis of why their method is sound. To my knowledge, most entities use the "safe harbor" approach of wiping any data in the legislated blacklist of dimensions.

replies(1): >>dwild+451

>>Zenst+Am
> "making it impossible to derive insights on a discreet individual"

doesn't clarify how much information you already have about the individual. There is a distinction between being able to identify someone without any prior knowledge about them vs re-identifying them. I don't think the GDPR is clear about that.

replies(1): >>Zenst+l91

>>Cynddl+ke
Can you provide a link to your paper?

replies(1): >>Cynddl+SK

>>Berisl+aa
> I would suggest "denonymization"

Very very close to the existing "deanonymization" which is essentially the opposite.

>>zoobab+(OP)
Article picture is an IBM iDataPlex system.

>>mcinty+Gk
>I guess then the interesting question is how high does k have to be to call it anonymous vs pseudonymous.

I think that for any size k less than the total size of the database, it is not anonymous. In cases like this, an overly strict definition favoring privacy is the only way to protect people. Similar to how we call 17 year olds children and treat them as such under law even though a 17 year old is far closer to an 18 year old than they are to a 5 year old (yes, there are some exceptions, but these are all explicitly called out). Another example of such an extreme is concerning falsifying data or making false statements. Even a single such statement, regardless of the number of true statements, destroys credibility once found when trust is extremely important. This is why even a single such statement can get one found in contempt of court or destroy a scientist's entire career (and even cast doubt on peers who were innocent).

Overall it is quite messy because it is a mix of a technical problem with a people problem.

replies(2): >>Shenda+7W >>dwohni+D41

>>pfortu+z6
An anonymized version would be one where you have a tally for each value/category in each field, without any correlating table between the fields. Only store the histograms.

replies(1): >>pfortu+5C

>>specia+wl
> The part I can't wrap my head around [...] The example I keep going back to is deanonymizing movie reviews

The reason is that you are thinking of an example that's not nicely compatible with differential privacy. The basic examples of DP would be something like a statistical query: approximately how many people gave Movie X three stars? You can ask a bunch of those queries, adding some noise, and be protected against re-identification.

You can still try to release a noisy version of the whole database using DP, but it will be very noisy. A basic algorithm (not good) would be something like

    For each entry (person, movie):
      with probability 0.02, keep the original rating
      otherwise, pick a rating at random

(A better one would probably compute a low-rank approximation, then add small noise to that.)

>>jobigo+4w
Well, yes, but then that is not what advertisers want... That is the thing.

Whenever a large enough database exists with individual data, we are doomed.

>>polski+t7
At ICML 2019 there was a good keynote by the chief scientist of the Census Bureau. The 2010 census was before a good understanding of differential privacy, so it wasn't really done correctly. The census bureau actually went back and bought a bunch of commercially available data, and deanonymized ~70% of the individuals in the US (if I remember correctly). So they wanted to do it better this time, and they seem to be taking the state of the art into account. It was much more impressive than I was expecting, actually.

https://icml.cc/Conferences/2019/ScheduleMultitrack?event=43... there is a video link on this page

>>rzwits+e6
>> Aggregated data that can no longer be reduced to a single individual at all.

Even aggregated data will loose the anonymisation characteristics when we are speaking of low volumes of data.

Number of cancer patients in the area A: 1 Number of residents in the area A: 1

replies(1): >>doomju+O03

>>mcinty+Gk
Isn't part of that problem how you define a "characteristic" (and thus, the combinations of them)? Because if it has k-anonymity for a certain set of characteristics, but you invent a new characteristic (based from combining info, datamining and perhaps other sources), the property doesn't necessarily hold any more.

>>mnky98+ms
The article is available here, in open access: https://www.nature.com/articles/s41467-019-10933-3

>>zoobab+(OP)
Effective de-identification is much harder than people think.

>>SkyBel+Ju
> I think that for any size k less than the total size of the database, it is not anonymous.

Wouldn't that require that every field of every record in the database be globally unique?

If something as simple as gender is a field in the database, the best k you could get would be the lowest count of records of each existent gender option.

replies(1): >>SkyBel+zY

>>Shenda+7W
It would mean that any identifying data being included would result in the data set not being considered anonymous. It will mean we have to tell people the data being sent to other organizations and companies is not anonymous, which is what should happen anyways. No more hiding behind 'we took the bare minimum steps required, if more advanced statistical information de-anonymized the data it isn't our fault'.

>>SkyBel+Ju
Is there any attribute of a person for which this is true?

I think what you're asking for is that any piece of data stored about someone be extensionally equivalent to "this is a human being" and no more which is not very useful (in an information-theoretic sense it has exactly zero use).

>>lmkg+To
Theses are pretty good ways to anonymize theses fields, but they are still just good if only that field is used.

Let say that you got 20,000 inhabitants, you'll only need about 14 fields that are binary, much less fields if they are not binary (which is quite likely to happen). You'll most likely already got the gender... Even if you limit the age to 10 possible values, that's equivalent to 3 binary fields!

>>shusso+bs
Interesting - would an example of what you outline be say a digital voice recording. Which, unless you know who is in the recording, you have no way to associate that digital data with an individual.

Would that example fall within the remit you outline and as such - skirt the whole GDPR aspect?

>>tialar+x4
To points 1 and 2: It's proven very difficult to sanitize datasets in a way that ensures anonymity, but doesn't render it useless. You aren't the first to think of these kinds of transformations.

There are problems with Point 3: we're continually surprised with how effectively smart people can identify people in datasets expected to be 'safe'. You've also not accounted for that a collection of non-identifying attributes may become identifying.

That said, the GDPR is largely about prohibiting unnecessary data collection, in the spirit of Point 3. Hopefully it'll help at least a little.

>>shalma+Yf
The amount of data and precision of web searches has grown enormously since the internet became active.

So, I would imagine everyone becomes more powerful over time.

>>mcinty+Gk
Part of the problem here too is that we may be dealing with more than one data set. If there is enough information overlap in those sets then you could map a record from one set to another. In that situation you'd have more data about each person, reducing the k-anonymity of the combined data sets to below the k-anonymity of either individual data set.

That concerns me most around places that process data for other companies (e.g., Cambridge Analytics, Facebook, Google, Amazon). These places could have access to many different data sets relating to a person, and could potentially combine these data sets to uniquely identify a single individual.

I recently looked at something that I gave a fake zip, birth date, and gender. Based on statistical probabilities it gave a 68% chance of a large data set having 1-anonymity. Wasn't clear what they were considering large, so could be bogus, but if true imagine what could easily be done with 10+ unique fields (e.g., zip, birthdate, gender, married?, # of children, ever smoked?, deductible amount, diabetes?, profession, BMI).

The earlier poster is right, only aggregate data is truly anonymous.

>>heaven+eF
Even for big numbers, if x = 100% or x = 0%

>>dalbas+ai
Rektor Seymour Skinner: "A certain agitator let's call her Lisa S. no that's too obvious let's say L. Simpson." [S07E05]

zlacker

Algorithm can pick out almost any American in supposedly anonymized databases