Audit of National Institutes of Health's Data Integrity Controls for the Sequence Read Archive Data
The National Center for Biotechnology Information (NCBI), part of the National Library of Medicine, hosts one of the National Institutes of Health's largest and most diverse datasets, the Sequence Read Archive (SRA). SRA is a broad collection of experimental DNA and RNA sequences that represent genome diversity. In 2019, SRA held 9 million records in 2 formats. The original format (23 petabytes) is received by NCBI from submitters and is instrument and experiment specific; these data are stored to tape. NCBI then transforms these original data into standard SRA normalized format (12.7 petabytes) for redistribution. Through this SRA normalized database, which is cloud based and accessed via NCBI servers, researchers can search metadata to locate the sequence reads for further analyses. SRA usage follows International Nucleotide Sequence Database Collaboration principles, which state that data are shared without restriction, that the individual submitting the data must be the owner of the data, and that ownership of the data remains with the submitter even after submission. This audit will concentrate on system integrity controls, including malicious code protection and data input validation as well as other Federal requirements for normalizing and archiving SRA data. The audit objective will be to determine whether NIH has implemented adequate system integrity controls to ensure the reliability of SRA data.
Announced or Revised | Agency | Title | Component | Report Number(s) | Expected Issue Date (FY) |
---|---|---|---|---|---|
Revised | National Institutes of Health | Audit of National Institutes of Health's Data Integrity Controls for the Sequence Read Archive Data | Office of Audit Services | WA-22-0005 (W-00-22-42043) | 2024 |