18 May 2018

Reidentification

Protecting unit-record level personal information: The limitations of de-identification and the implications for the Privacy and Data Protection Act by Vanessa Teague, Chris Culnane and Benjamin Rubinstein for the Office of the Victorian Information Commissioner (OVIC) offers cautions about de-identication in Victoria's public and private sectors.

The report states
De-identification is a subject that has received much attention in recent years from privacy regulators around the globe. Once touted as a silver bullet for protecting the privacy of personal information, the reality is that when it involves the release of data to the public, the process of de-identification is much more complex. 
As improvements in technology increase the type and rate at which data is generated, the possibility of re-identification of publicly released data is greater than ever. Auxiliary information – or secondary information – can be used to connect an individual to seemingly de-identified data, enabling an individual’s identity to be ascertained. Auxiliary information can come from anywhere, including other publicly available sources online. 
In recent examples of successful re-identification that we have seen in Australia, it is clear that those releasing de-identified data did not appreciate the auxiliary information that would be available for re-identification – in that they did not expect re-identification would be possible. Individual data elements may be non-distinct and recognisable in many people, but a combination of them will often be unique, making them attributable to a specific individual. This is why de-identification poses a problem for unit-record level data.
 OVIC comments
This report is one of a number of publications on de-identification produced by, or for, the Victorian public sector. Notably, in early 2018 Victoria’s Chief Data Officer issued a de-identification guideline to point to what ‘reasonable steps’ for de-identification looks like in the context of data analytics and information sharing under the Victorian Data Sharing Act 2017 (VDS Act). This paper is not aimed at the work conducted by the Victorian Centre for Data Insights (VCDI), where information sharing occurs within government with appropriate controls, and it is not intended to inhibit that work. Rather, it speaks to the use of de-identification more broadly, in circumstances where so-called ‘de-identified’ data is made freely available through public or other less inhibited release of data sets, which occurs in so-called “open data” programs. This report should be interpreted in that context. ...
This report has been produced to demonstrate the complexities of de-identification and serve as a reminder that even if direct identifiers have been removed from a data set, it may still constitute ‘personal information’. The intention is not to dissuade the use of de-identification techniques to enhance privacy, but to ensure that those relying on and sharing de-identified information to drive policy design and service delivery, understand the challenges involved where the husbandry of that data is not managed. ... Public release of de-identified information may not always be a safe option, depending on the techniques used to treat the data and the auxiliary information that the public may have access to. Wherever unit level data – containing data related to individuals – is used for analysis, OVIC’s view is that this is most appropriately performed in a controlled environment by data scientists. Releasing the data publicly in the hope that ‘de-identification’ provides protection from a privacy breach is, as this paper demonstrates, a risky enterprise.
The authors go on to state
A detailed record about an individual that has been de-identified, but is released publicly, is likely to be reidentifiable, and there is unlikely to be any feasible treatment that retains most of the value of the record for research, and also securely de-identifies it. A person might take reasonable steps to attempt to deidentify such data and be unaware that individuals can still be reasonably identified.
The word ‘de-identify’ is, unfortunately, highly ambiguous. It might mean removing obvious identifiers (which is easy) or it might mean achieving the state in which individuals cannot be ‘reasonably identified’ by an adversary (which is hard). It is very important not to confuse these two definitions. Confusion causes an apparent controversy over whether de-identification “works”, but much of this controversy can be resolved by thinking carefully about what it means to be secure. When many different data points about a particular individual are connected, we recommend focusing instead on restricting access and hence the opportunity for misuse of that data. Secure research environments and traditional access control mechanisms are appropriate.
Aggregated statistics, such as overall totals of certain items (even within certain groups of individuals) could possibly be safely released publicly. Differential privacy offers a rigorous and strong definition of privacy protection, but the strength of the privacy parameters must be traded off against the precision and quantity of the published data.
This paper discusses de-identification of a data set in the context of release to the public, for example via the internet, where it may be combined with other data. That context includes the concept of “open data”, in which governments make data available for any researchers to analyse in the hope they can identify issues or patterns of public benefit.
Therefore, it’s important to emphasise that this document should not be read as a general warning against data sharing within government, or in a controlled research environment where the combination of the data set with other data can be managed. It is not intended to have a chilling effect on sharing of data in those controlled environments.
 In reference to statutory responsibilities the report comments
In taking ‘reasonable steps’, a data custodian must have regard to not only the mathematical methods of de-identifying the information, but also “the technical and administrative safeguards and protections implemented in the data analytics environment to protect the privacy of individuals”.
Therefore, there is a possibility that in some circumstances, a dataset in which ‘reasonable steps’ have been taken for de-identification under the VDS Act may not be de-identified according to the PDP Act, because individuals may still be ‘reasonably identified’ if the records are released publicly outside the kinds of research environments described in the VDS Act.
In this report, we describe the main techniques that are used for de-identifying personal information. There are two main ways of protecting the privacy of data intended for sharing or release: removing information, and restricting access. We explain when de-identification does (or does not) work, using datasets from health and transport as examples. We also explain why these techniques might fail when the de-identified data is linked with other data, so as to produce information in which an individual is identifiable.
Does de-identification work? In one sense, the answer is obviously yes: de-identification can protect privacy by deleting all the useful information in a data set. Conversely, it could produce a valuable data set by removing names but leaving in other personal information. The question is whether there is any middle ground; are there techniques for de-identification that “work” because they protect the privacy of unit-record level data while preserving most of its scientific or business value?
Controversy also exists in arguments about the definitions of ‘de-identification’ and ‘work’. De-identification might mean:
• following a process such as removing names, widening the ranges of ages or dates, and removing unusual records; or 
• achieving the state in which individuals cannot be ‘reasonably identified’.
These two meanings should not be confused, though they often are. A well-intentioned official might carefully follow a de-identification process, but some individuals might still be ‘reasonably identifiable’. Compliance with de-identification protocols and guidelines does not necessarily imply proper mathematical protections of privacy. This misunderstanding has potential implications for privacy law, where information that is assumed to be de-identified is treated as non-identifiable information and subsequently shared or released publicly.
De-identification would work if an adversary who was trying to re-identify records could not do so successfully. Success depends on ‘auxiliary information’ – extra information about the person that can be used to identify their record in the dataset. Auxiliary information could include age, place of work, medical history etc. If an adversary trying to re-identify individuals does not know much about them, re-identification is unlikely to succeed. However, if they have a vast dataset (with names) that closely mirrors enough information in the de-identified records, re-identification of unique records will be possible.
4. Can the risk of re-identification be assessed?
For a particular collection of auxiliary information, we can ask a well-defined mathematical question: can someone be identified uniquely based on just that auxiliary information?
There are no probabilities or risks here – we are simply asking what can be inferred from a particular combination of data sets and auxiliary information. This is generally not controversial. The controversy arises from asking what auxiliary information somebody is likely to have.
For example, in the Australian Department of Health's public release of MBS/PBS billing data, those who prepared the dataset carefully removed all demographic data except the patient’s gender and year of birth, therefore ensuring that demographic information was not enough on its own to identify individuals. However, we were able to demonstrate that with an individual's year of birth and some information about the date of a surgery or other medical event, the individual could be re-identified. There was clearly a mismatch between the release authority's assumptions and the reality about what auxiliary information could be available for re-identification.
5. How re-identification works
Re-identification works by identifying a ‘digital fingerprint’ in the data, meaning a combination of features that uniquely identify a person. If two datasets have related records, one person's digital fingerprint should be the same in both. This allows linking of a person's data from the two datasets – if one dataset has names then the other dataset can be re-identified.
Computer scientists have used linkage to re-identify de-identified data from various sources including telephone metadata, social network connections, health data and online ratings, and found high rates of uniqueness in mobility data and credit card transactions.  Simply linking with online information can work.
Most published re-identifications are performed by journalists or academics. Is this because they are the only people who are doing re-identification, or because they are the kind of people who tend to publish what they learn? Although by definition we won’t hear about the unpublished re-identifications, there are certainly many organisations with vast stores of auxiliary information. The database of a bank, health insurer or employer could contain significant auxiliary information that could be of great value in re-identifying a health data set, for example, and those organisations would have significant financial incentive to do so. The auxiliary information available to law-abiding researchers today is the absolute minimum that might be available to a determined attacker, now or in the future.
This potential for linkage of one data set with other data sets is why the federal Australian Government's draft bill to criminalise re-identification is likely to be ineffective, and even counterproductive. If re-identification is not possible then it doesn't need to be prohibited; if re-identification is straightforward then governments (and the people whose data was published) need to find out.
The rest of this report examines what de-identification is, whether it works, and what alternative approaches may better protect personal information. After assessing whether de-identification is a myth, we outline constructive directions for where to go from here. Our technical suggestions focus on differential privacy and aggregation. We also discuss access control via secure research environments