National Institute of Standards and Technology release draft guide on De identifying Government data sets. A very useful guide for all those who practice privacy and cyber security.

November 18, 2022 |

De identifying data is a critical part of managing data, avoiding reputational damage if there is a data breach and complying with privacy legislation.  It is fundamental yet poorly understood, let alone implemented.  The National Institute of Standards and Technology has released the third draft of its De-Identifying Government Data Sets .   As with many NIST reports it is lengthy not to mention highly technical.  But it is worth reading.  The NIST provides the best technical guides in the privacy and cyber security sphere.

This is an excellent guide because it sets out clearly what deidentificatio involves, why it is important, what the risks are and how organisations and agencies should approach de identification. The United Kingdom’s Information Commissioner has prepared excellence guidance on Anonymisation, pseudonymisation and privacy enhancing technologies.  Given the nature of recent data breaches in Australia de identifying older records is important.  The guidance in Australia is inadequate. 

The abstract provides:

De-identification is a process that is applied to a dataset with the goal of preventing or limiting informational risks to individuals, protected groups, and establishments while still allowing for meaningful statistical analysis. Government agencies can use de-identification to reduce the privacy risk associated with collecting, processing, archiving, distributing, or publishing government data. Previously, NISTIR 8053, De-Identification of Personal Information, provided a survey of de-identification and re-identification techniques. This document provides specific guidance to government agencies that wish to use de-identification. Before using de-identification, agencies should evaluate their goals for using de-identification and the potential risks that de-identification might create. Agencies should decide upon a de-identification release model, such as publishing de-identified data, publishing synthetic data based on identified data, or providing a query interface that incorporates de-identification. Agencies can create a Disclosure Review Board to oversee the process of de-identification. They can also adopt a de-identification standard with measurable performance levels and perform re-identification studies to gauge the risk associated with de-identification. Several specific techniques for de-identification are available, including de-identification by removing identifiers and transforming quasi-identifiers and the use of formal privacy models. People performing de-identification generally use special-purpose software tools to perform the data manipulation and calculate the likely risk of re-identification. However, not all tools that merely mask personal information provide sufficient functionality for performing de-identification. This document also includes an extensive list of references, a glossary, and a list of specific de-identification tools, which is only included to convey the range of tools currently available and is not intended to imply a recommendation or endorsement by NIST.

The Executive Summary provide:

Many Government documents use the phrase personally identifable information (PII) to describe private information that can be linked to an individual [62, 79], although there are a variety of defnitions for PII. As a result, it is possible to have information that singles out individuals but that does not meet a specifc defnition of PII. This document therefore presents ways of removing or altering information that can identify individuals that go beyond merely removing PII. For decades, de-identifcation based on simply removing of identifying information was thought to be suffcient to prevent the re-identifcation of individuals in large datasets. Since the mid 1990s, a growing body of research has demonstrated the reverse, resulting in new privacy attacks capable of re-identifying individuals in “de-identifed” data releases. For several years the goals of such attacks appeared to be the embarrassment of the publishing agency and achieving academic distinction for the privacy researcher . More recently, as high-resolution de-identifed geolocation data has become commercially available, reidentifcation techniques have been used by journalists and activists  with the goal of learning confdential information.
These attacks have become more sophisticated in recent years with the availability of geolocation data, highlighting the defciencies in traditional Formal models of privacy, like k-anonymity and differential privacy, use mathematically rigorous approaches that are designed to allow for the controlled use of confdential data while minimizing the privacy loss suffered by the data subjects. Because there is an inherent trade-off between the accuracy of published data and the amount of privacy protection afforded to data subjects, most formal methods have some kind of parameter that can be adjusted to control the “privacy cost” of a particular data release. Informally, a data release with a low privacy cost causes little additional privacy risk to the participants, while a higher privacy cost results in more privacy risk. When they are available, formal privacy methods shoudl be preferred over informal, ad hoc methods.
Decisions and practices regarding the de-identifcation and release of government data can be integral to the mission and proper functioning of a government agency. As such, an agency’s leadership should manage these activities in a way that assures performance and results in a manner that is consistent with the agency’s mission and legal authority. One way that agencies can manage this risk is by creating a formal Disclosure Review Board (DRB) that consists of legal and technical privacy experts, stakeholders within the organization, and representatives of the organization’s leadership. The DRB evaluated applications for data release that describe the confdential data, the techniques that will be used to minimize the risk of disclosure, the resulting protected data, and how the effectiveness of those techniques will be evaluated.
Establishing a DRB may seem like an expensive and complicated administrative undertaking for some agencies. However, a properly constituted DRB and the development of consistent procedures regarding data release should enable agencies to lower the risks associated with each data release, which is likely to save agency resources in the long term.
Agencies can create or adopt standards to guide those performing de-identifcation, and regarding regarding the accuracy of de-identifed data. If accuracy goals exist, then techniques such as differential privacy can be used to make the data suffciently accurate for the intended purpose but not unnecessarily more accurate, which can limit the amount of privacy loss. However, agencies must carefully choose and implement accuracy requirements. If data accuracy and privacy goals cannot be well-maintained, then releases of data that are not suffciently accurate can result in incorrect scientifc conclusions and policy decisions.
Agencies should consider performing de-identifcation with trained individuals using software specifcally designed for the purpose. While it is possible to perform de-identifcation with off-the-shelf software like a commercial spreadsheet or fnancial planning program, such programs typically lack the key functions required for proper de-identifcation. As a result, they may encourage the use of simplistic de-identifcation methods, such as deleting sensitive columns and manually searching and removing data that appears sensitive. This may result in a dataset that appears de-identifed but that still contain signifcant disclosure risks. Finally, different countries have different standards and policies regarding the defnition and use of de-identifed data. Information that is regarded as de-identifed in one jurisdiction may be regarded as being identifable in another.

The conclusion provides:

Government agencies can use de-identifcation technology to make datasets available to researchers and the public without compromising the privacy of the people contained within the data. There are currently three primary models available for de-identifcation:

1. agencies can make data available with traditional de-identifcation techniques that rely on the suppression of identifying information (direct identifers) and the manip-ulation of information that partially identifes (quasi-identifers);

2. agencies can create synthetic datasets; and

3. agencies can make data available through a query interface.

These models can be mixed within a single dataset to provide different kinds of access for different users or intended uses. Privacy protection can be strengthened when agencies employ formal models for privacy protection, such as differential privacy, because the mathematical models that these sys-tems use are designed to ensure privacy protection irrespective of future data releases or developments in re-identifcation technology. However, the mathematics underlying these systems is very new, and there is little experience within the Government in using these systems. Thus, agencies should understand the implications of these systems before de-ploying them in place of traditional de-identifcation approaches that do not offer formal privacy guarantees.

Some interesting matters contained in the report are:

  • De-identifcation is a process that is applied to a dataset with the goal of preventing or limiting privacy risks to individuals, protected groups, and establishments while still allowing for the production of aggregate statistics.1 De-identifcation is not a single technique, but a collection of approaches, algorithms, and tools that can be applied to different kinds of data with differing levels of effectiveness. In general, the potential risk to privacy posed by a dataset’s release decreases as more aggressive de-identifcation techniques are employed, but data accuracy and – in some cases – the ultimate utility of the de-identifed dataset decreases as well.
  • Accuracy is traditionally defned as the “closeness of computations or estimates to the exact or true values that the statistics were intended to measure”. The data accuracy of de-identifed data, therefore, refers to the degree to which inferences drawn on the deidentifed data will be consistent with inferences drawn on the original data. Data accuracy can be measured by the ratio of a value computed with de-identifed data to the same value computed using the underlying true confdential value.
  • data accuracy refers to the abstract characteristic of the data as determined by a specifc, measurable statistic. In general, data accuracy decreases as more aggressive de-identifcation techniques are employed. ,
  • data utility refers to the beneft derived from the application of the data to a specifc use.
  • data may have low accuracy because they contain errors or substantial noise, yet users may nevertheless derive high value from the data, giving the data high utility. Likewise, data that are very close to the reality of the thing being measured may have high accuracy but may be fundamentally worthless and, thus, have low utility.
  • Re-identifcation is the general term for any process that restores the association between a set of de-identifed data and the data subject. Re-identifcation is not the only way that de-identifcation techniques can fail to protect privacy. Improperly de-identifed information can also be used to infer private facts about individuals that were thought to have been protected.
  • Re-identifcation risk is the likelihood that a third party can re-identify data subjects in a de-identifed dataset. Re identifcation risk is typically a function of the adverse impacts that would arise if the re-identifcation were to occur and the likelihood of occurrence.
  • Redaction is the removal of information from a document or dataset for legal or security purposes. Redaction alone is not suffcient to provide formal privacy guarantees, such as differential privacy. Redaction may also reduce the data accuracy of the dataset since the use of selective redaction may result in the introduction of non-ignorable bias.
  • Anonymization is a “process that removes the association between the identifying dataset and the data subject” . This term is reserved for de-identifcaiton processes that cannot be reversed.
  • Pseudonymization is a “particular type of [de-identifcation] that both removes the association with a data subject and adds an association between a particular set of characteristics relating to the data subject and one or more pseudonyms” . The term coded is frequently used in healthcare settings to describe data that has been pseudonymized.
  • pseudonymization is commonly used so that multiple observations of an individual over time can be matched and so that an individual can be re-identifed if there is a policy reason to do so.
  • although pseudonymous data are typically re-identifed by consulting a key that may be highly protected, the existence of the pseudonym identifers frequently increases the risk of re-identifcation through other means.any effort that involves the release of data that contain personal information typically involves making a trade-off between identifability and data accuracy. Increased privacy protections do not necessarily result in decreased data utility.
  • Disclosure is generally the exposure of data beyond the original collection use.
  • Disclosure limitation is a general term for the practice of allowing summary information or queries on data within a dataset to be released without revealing information about spe-cifc individuals whose personal information is contained within the dataset.
  • Differential privacy is a model based on a mathematical defnition of privacy that con-siders the risk to an individual from the release of a query on a dataset containing their personal information.
  • some users of de-identifed data may be able to use the data to make inferences about private facts regarding the data subjects. They may even be able to re-identify the data subjects. Both of these uses undo the privacy goals of de-identifcation.
  • the risk of re-identifcation. Agencies should aim to make an informed decision about the fdelity of the data that they release by systematically evaluating the risks and benefts and choosing de-identifcation techniques and data sharing models that are tailored to their requirements.
  • when telling individuals that their de-identifed information will be released, agencies should disclose that privacy risks may remain despite de-identifcation.
  • it is essential to plan for the successful de-identifcation and data release including:
    • the research design,
    • data collection, protection of identifers,
    • disclosure analysis, and data-sharing strategy.
    • a comprehensive analysis of the purpose of the data release and the expected use of the released data,
    • the privacy-related risks,
    • the privacy protecting controls. B
  • De-identifcation can have signifcant costs, including time, labor, and data processing costs. When properly executed, this effort can result in data that have high value for a research community and the general public while still adequately protecting individual privacy.

 

 

Leave a Reply





Verified by MonsterInsights