The National Institute of Science and Technology releases Cybersecurity of Genomic Data

March 6, 2023 |

The National Institute of Science and Technology (“NIST”) has released its initial draft of Cybersecurity of Genomic Data.

The media release provides:

Genomic data has enabled the rapid growth of the U.S. bioeconomy and is valuable to the individual, industry, and government due to intrinsic properties that, in combination, make it different from other types of high-value data which possess only a subset of these properties. The characteristics of genomic data compared to other high value datasets raises some correspondingly unique cybersecurity and privacy challenges that are inadequately addressed with current policies, guidance, and technical controls.

This report describes current practices in risk management, cybersecurity, and privacy management for protecting genomic data, as well as the associated challenges and concerns. It identifies gaps in protection practices across the genomic data lifecycle and proposes solutions to address real-life use cases occurring at various stages of the genomic data lifecycle. This report also is intended to provide areas for regulatory/policy enactment or further research.

Genomic data has multiple intrinsic properties that in combination make it different from other types of high value data which possess only a subset of these properties. The characteristics of genomic data compared to other high value datasets raises unique cybersecurity and privacy challenges.

The NIST  report proposes a set of solution ideas that address real-life use cases occurring at various stages of the genomic data lifecycle along with candidate mitigation strategies and the expected  benefits of the solutions. Additionally, areas needing regulatory/policy enactment or further research are highlighted.

Cyber attacks targeted at genomic data include attacks against:

  • the confidentiality of the data,
  • data  integrity and its availability.
  • the confidentiality of the data can threaten the economy through theft of the intellectual property owned by the  biotechnology industry,
  • the integrity of the data can disrupt:
    • biopharmaceutical output,
    • agricultural food production,
    • bio-manufacturing activity.
  • the availability of the data include:
    • encrypting for ransom,
    • deletion of data, and
    • disabling critical automated equipment used in:
      • research,
      • development,
      • and manufacturing.
  • the potential harms of cyber attacks on genomic data threaten national security including enabling the development of biological weapons and the surveillance, oppression, and extortion of our citizens, military, and intelligence personnel based on their genomic data.
  • genomic data can also harm individuals by enabling blackmail, discrimination based on disease risk, and privacy loss from the revealing of hidden consanguinity or phenotypes including health, emotional stability, mental capacity, appearance, and physical abilities.

There is insufficient predictability, manageability, and disassociability in the genomic data processing. That can result in privacy problems if individuals are surprised by what is happening with their genomic data. Insufficient manageability in data processing can arise when the capabilities are not in place to allow for appropriately granular administration of genomic data. Permitting access to raw genomic data, instead of using appropriate privacy-enhancing technologies to extract only the necessary insights (without revealing the raw data), introduces privacy risks from insufficient disassociability in data processing. Each of these areas of privacy risks can disrupt the ability to realize the benefits of processing genomic data

Genomes contain:

  • hereditary material comprised of nucleic acids, mostly in the form of  deoxyribonucleic acid (DNA).
  • the full set of instructions to form an organism and are largely unchanged from conception to death.

Genomic data is immutable, associative, and conveys important health, phenotype, and personal  information about individuals and their kin (past and future). In some cases, small fragments of genomic data stripped of identifiers can be used to re-identify persons, though the vast majority of the genome is shared among individuals

Loss of control of genomic data can cause risks to privacy, personal security, and national security, as adversaries can use genomic data for nefarious reasons such as surveillance, oppression, and extortion. Genomic database breaches or other losses of data may result in thefts of intellectual property and put the U.S. at a competitive disadvantage in biotechnology. As reported by national security experts, security threats may arise through the creation of population-specific bioweapons or compromised identities of national security agent. Cyber attacks have occurred on genomic databases, DNA sequencing instruments, and genomic software tools [

Privacy challenges resulting from the use of human genomic data include problems for individuals such as enabling blackmail, discrimination based on disease risk, and the revelation of hidden consanguinity or phenotypes including health, emotional stability, mental capacity,  appearance, and physical abilities

Potential privacy problems include:

  • re-identification of de-identified genomic data: Human genomic data, even small fragments of a person’s whole genome, can usually be re-identified for some populations when combined with available datasets, such as ancestry data, self-shared identified genomic data of distant relatives, surname inference, age, etc.
  • unanticipated revelation of individuals’ blood-relatives can lead to dignity loss when those relationships are identified: Consanguineal ties may be revealed that may be embarrassing or incriminating, resulting in psychological or reputational harm

The technical solution gaps are:

  •  sequencer manufacturers do not provide SBOMs for their devices. Therefore  security professionals have no visibility into potential vulnerabilities of their software and cannot adequately advise users of sequencers on how to address discovered software vulnerabilities through patching or other mitigation measures.
  • sequencers are typically connected to a network and the internet. This provides access to the manufacturer for updates and transfer of files to secure storage. There is no guidance on the network addresses and the corresponding network protocols that are required for sequencers to effectively operate.
  • the problem of data confinement, (i.e., authorized users and/or their software sharing unauthorized access to data), is a well-known unsolved problem in cybersecurity. Due to the privacy risks to subjects as well as the high value of many types of genomic data, the confinement problem is of relevance to genomic data. This problem is most commonly addressed by contractual controls that can be particularly complex when the controls are between multiple organizations. Contractual controls typically do not prevent unauthorized data sharing, but provide penalties if it is done. These penalties typically cannot redress the privacy loss of patients and data subjects.
  • most genetic data sharing and processing occurs in cloud environments, frequently leveraging containers (e.g., Docker or Pods). Many cybersecurity vulnerability scanners are not optimized for scanning containers, resulting in an inability to identify certain 843 vulnerabilities and a high number of false positives.
  • in the healthcare of a patient, the nature and size of genomic data presents a challenge with incorporating it into workflows.

The privacy considerations the NIST recommends include:

  • human subjects who provide their genomic data for research expect that their privacy will be  protected by all organizations involved in the research process. Human genomic data can reveal a great deal of personal information, such as physical traits, predisposition to certain health  conditions, and biological relationships.
  • andividual privacy may be impacted when an individual is identified, and other protected or sensitive information is uncovered or made available.
  • the genetic data itself can serve as an identifier and provide additional information about the contributor.
  •  aggregate data in large genomic databases can lead to the identification of individuals or their biological relatives.
  • deidentifying genomic data is impossible without destroying some or all of utility of that data.
  • human subjects provide their data under informed consent, which may further limit the processing of their data to specific uses.
  • Genomic data collected with differing informed consents is often aggregated, but when this is done, care must be taken that all data is still used within the boundaries specified in the informed consent of each data subject. Organizations that process genomic data need to be able to effectively  communicate internally and externally about managing these and other privacy risks.

The NIST Privacy Framework prioritizes the policies and technical capabilities they need to manage the privacy risks that may arise from data processing, including processing genomic data.

Solutions  for addressing subject data  privacy and preventing the breach of data confinement include:

  • a genomic analysis platform brings researchers to a secure location, requiring the use of plaintext and blocking all egress.
  • a “Secret Store” requires the use of predefined executables with no direct visibility to the  data.
  • all data is stored so that analysis is done in a cloud environment created by a trusted  authority, and all activity in that environment is monitored. It is not a complete  solution, as researchers can provide visibility and/or their authorization to others.
  • federated Machine Learning  where collaborative learning is done among individual local nodes and shared with a centralized processor to produce an aggregate model
  • differential privacy can provide privacy guarantees and used to produce synthetic data  for machine learning. It adds noise to datasets so that an individual’s data cannot be distinguished within the dataset and allows the data to be analyzed and shared.
  • privacy enhancing cryptography  addresses both data confinement and subject data privacy. Fully homomorphic encryption allows computation on encrypted data and progress has been rapid at addressing its principal drawback of increased computational complexity.
  • secure multi-party computation where analysis is performed over the data sets of several parties without revealing their input.
  • there may be a place for zero-knowledge proofs for validating results of genomic analysis without revealing the solution, which may fully preserve the privacy of the data underpinning the results.

Fully homomorphic encryption allows computation on encrypted data. Another promising technique is secure multi-party computation where analysis is performed over the data sets of several parties without revealing their input. There may be a place for zero-knowledge proofs for validating results of genomic analysis without revealing the solution, which may fully preserve the privacy of the data  underpinning the results.

 

 

Leave a Reply





Verified by MonsterInsights