Singapore Privacy Commissioner releases a Data Anonymisation Tool

June 8, 2022 |

The Singapore Privacy Commissioner has launched a free Data Anonymisation tool. Anonymisation is an important part of privacy protection, particularly in relation to the preparation of data sets.  It is also quite a contested issue. 

The statement provides:

The PDPC has launched a free Data Anonymisation tool to help organisations transform simple datasets by applying basic anonymisation techniques. An infographic that provides guidance on how to use the tool is also included.


Basic Anonymisation

With the increasing collection and use of personal data, organisations may find themselves at more risk of encountering data breaches and anonymising data is one way to reduce that risk. Anonymised personal data can be used to generate insights for innovation while providing protection to individuals.

Organisations can perform basic anonymisation of their datasets through a simple 5-step process:

Anonymisation Process

Guide to Basic Anonymisation

Learn techniques in anonymising data and how to appropriately perform de-identification and anonymisation of various datasets with the Guide to Basic Anonymisation .

It helps organisations to share data with other organisations or entities, where additional administrative and technical controls may be imposed to reduce the risk of unauthorised disclosure of personal data.

Data Anonymisation Tool

Use this free data anonymisation tool to transform simple datasets by applying anonymisation techniques. An infographic that provides guidance on how to use the tool is also included. 

System Requirements:

  • The Data Anonymisation tool was developed and tested using Excel 2010, 2016 and 2019 on Microsoft Windows Operating System (OS). The tool may not work properly on other versions of Excel on Windows.
  • Ensure that the regional format on your device is set to “English (Singapore)” by accessing the “Region” settings within Windows, or searching “Region” using the Windows search bar.

To check that the tool that you have downloaded is authentic and has not been tampered with, take the following steps:

  1. Open the Windows command prompt by entering “cmd” in the Windows search bar found at the bottom-left of your screen.
  2. Find the folder containing the downloaded tool. (E.g., type “cd C:\Users\<your_user_name>\Downloads” if the tool was downloaded to your Downloads folder)
  3. Run the certutil command by typing “certutil -hashfile sha256”.
  4. Check that the hash output is cee3f316868cc9674461bf03bb5c9b26f99a02cb739392c70def3412c93525c0 to verify the file’s authenticity.

Notwithstanding the anonymisation tool the question of how effective anonymisation can be is a matter of controversy. In the recent Nature Communications article Estimating the success of re-identifications in incomplete datasets using generative models the authors cast doubt on the ability of data sets to avoid being re identified.  The abstract provides:

While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-andforget model

And by way of example they state:

Yet numerous supposedly anonymous datasets have recently been released and re-identified. In 2016, journalists reidentified politicians in an anonymized browsing history dataset of 3 million German citizens, uncovering their medical information and their sexual preferences. A few months before, the Australian Department of Health publicly released de-identified medical records for 10% of the population only for researchers to re-identify them 6 weeks later. Before that, studies had shown that de-identified hospital discharge data could be re-identified using basic demographic attributes and that diagnostic codes, year of birth, gender, and ethnicity could uniquely identify patients in genomic studies data. Finally, researchers were able to uniquely identify individuals in anonymized taxi trajectories in NYC27, bike sharing trips in London28, subway data in Riga, and mobile phone and credit card datasets.

In its discussion the paper states:

In this paper, we proposed and validated a statistical model to quantify the likelihood for a re-identification attempt to be successful, even if the disclosed dataset is heavily incomplete. Beyond the claim that the incompleteness of the dataset provides plausible deniability, our method also challenges claims that a low population uniqueness is sufficient to protect people’s privacy. Indeed, an attacker can, using our model, correctly re-identify an individual with high likelihood even if the population uniqueness is low (Fig. 3a). While more advanced guarantees like k-anonymity would give every individual in the dataset some protection, they have been shown to be NP-Hard58, hard to achieve in modern high-dimensional datasets, and not always sufficient.
While developed to estimate the likelihood of a specific reidentification to be successful, our model can also be used to estimate population uniqueness. We show in Supplementary Note 1 that, while not its primary goal, our model performs consistently better than existing methods to estimate population uniqueness on all five corpora (Supplementary Fig. 4, P < 0.05 in 78 cases out of 80 using Wilcoxon’s signed-rank test)61–66 and consistently better than previous attempts to estimate individual uniqueness. Existing approaches, indeed, exhibit unpredictably large over- and under-estimation errors. Finally, a recent work quantifies the correctness of individual re-identification in incomplete (10%) hospital data using complete population frequencies. Compared to this work, our approach does not require external data nor to assume this external data to be complete.
To study the stability and robustness of our estimations, we perform further experiments (Supplementary Notes 2–8).
First, we analyze the impact of marginal and association parameters on the model error and show how to use exogenous information to lower it. Table 1 and Supplementary Note 7 show that, at very small sampling fraction (below 0.1%), where the error is the largest, the error is mostly determined by the marginals, and converges after few hundred records when the exact marginals are known. The copula covariance parameters exhibit no significant bias and decrease fast when the sample size increases (Supplementary Note 8).
As our method separates marginals and association structure inference, exogenous information from larger data sources could also be used to estimate marginals with higher accuracy. For instance, count distributions for attributes such as date of birth or ZIP code could be directly estimated from national surveys. We replicate our analysis on the USA corpus using a subsampled dataset to infer the association structure along with the exact counts for marginal distributions. Incorporating exogenous information reduces, e.g., the mean MAE of uniqueness across all corpora by 48.6% (P < 0.01, Mann–Whitney) for a 0.1% sample. Exogenous information
become less useful as the sampling fraction increases (Supplementary Table 2).

Leave a Reply