Limits...
Standardized metadata for human pathogen/vector genomic sequences.

Dugan VG, Emrich SJ, Giraldo-Calderón GI, Harb OS, Newman RM, Pickett BE, Schriml LM, Stockwell TB, Stoeckert CJ, Sullivan DE, Singh I, Ward DV, Yao A, Zheng J, Barrett T, Birren B, Brinkac L, Bruno VM, Caler E, Chapman S, Collins FH, Cuomo CA, Di Francesco V, Durkin S, Eppinger M, Feldgarden M, Fraser C, Fricke WF, Giovanni M, Henn MR, Hine E, Hotopp JD, Karsch-Mizrachi I, Kissinger JC, Lee EM, Mathur P, Mongodin EF, Murphy CI, Myers G, Neafsey DE, Nelson KE, Nierman WC, Puzak J, Rasko D, Roos DS, Sadzewicz L, Silva JC, Sobral B, Squires RB, Stevens RL, Tallon L, Tettelin H, Wentworth D, White O, Will R, Wortman J, Zhang Y, Scheuermann RH - PLoS ONE (2014)

Bottom Line: The standard includes data fields about characteristics of the organism or environmental source of the specimen, spatial-temporal information about the specimen isolation event, phenotypic characteristics of the pathogen/vector isolated, and project leadership and support.By modeling metadata fields into an ontology-based semantic framework and reusing existing ontologies and minimum information checklists, the application standard can be extended to support additional project-specific data fields and integrated with other data represented with comparable standards.The use of this metadata standard by all ongoing and future GSCID sequencing projects will provide a consistent representation of these data in the BRC resources and other repositories that leverage these data, allowing investigators to identify relevant genomic sequences and perform comparative genomics analyses that are both statistically meaningful and biologically relevant.

View Article: PubMed Central - PubMed

Affiliation: J. Craig Venter Institute, Rockville, Maryland, and La Jolla, California, United States of America; National Institute of Allergy and Infectious Diseases, Rockville, Maryland, United States of America.

ABSTRACT
High throughput sequencing has accelerated the determination of genome sequences for thousands of human infectious disease pathogens and dozens of their vectors. The scale and scope of these data are enabling genotype-phenotype association studies to identify genetic determinants of pathogen virulence and drug/insecticide resistance, and phylogenetic studies to track the origin and spread of disease outbreaks. To maximize the utility of genomic sequences for these purposes, it is essential that metadata about the pathogen/vector isolate characteristics be collected and made available in organized, clear, and consistent formats. Here we report the development of the GSCID/BRC Project and Sample Application Standard, developed by representatives of the Genome Sequencing Centers for Infectious Diseases (GSCIDs), the Bioinformatics Resource Centers (BRCs) for Infectious Diseases, and the U.S. National Institute of Allergy and Infectious Diseases (NIAID), part of the National Institutes of Health (NIH), informed by interactions with numerous collaborating scientists. It includes mapping to terms from other data standards initiatives, including the Genomic Standards Consortium's minimal information (MIxS) and NCBI's BioSample/BioProjects checklists and the Ontology for Biomedical Investigations (OBI). The standard includes data fields about characteristics of the organism or environmental source of the specimen, spatial-temporal information about the specimen isolation event, phenotypic characteristics of the pathogen/vector isolated, and project leadership and support. By modeling metadata fields into an ontology-based semantic framework and reusing existing ontologies and minimum information checklists, the application standard can be extended to support additional project-specific data fields and integrated with other data represented with comparable standards. The use of this metadata standard by all ongoing and future GSCID sequencing projects will provide a consistent representation of these data in the BRC resources and other repositories that leverage these data, allowing investigators to identify relevant genomic sequences and perform comparative genomics analyses that are both statistically meaningful and biologically relevant.

Show MeSH

Related in: MedlinePlus

Semantic Network of the Core Project Data Fields.A semantic representation of the entities relevant to describe infectious disease projects based on the OBI and other OBO Foundry ontologies is shown. Distinctions are made between material entities (blue outlines), information entities and qualities (black outlines), and processes (red outlines). Entities are connected by standard semantic relations, in italic. The subset of entities selected as Core Project fields are noted with ovals containing the respective Field ID. For example, both the “Project Title” (CP1) and “Project ID” (CP2) denote an OBI:Investigation; the “Project Description” (CP3) is_about the same OBI:Investigation.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4061050&req=5

pone-0099979-g002: Semantic Network of the Core Project Data Fields.A semantic representation of the entities relevant to describe infectious disease projects based on the OBI and other OBO Foundry ontologies is shown. Distinctions are made between material entities (blue outlines), information entities and qualities (black outlines), and processes (red outlines). Entities are connected by standard semantic relations, in italic. The subset of entities selected as Core Project fields are noted with ovals containing the respective Field ID. For example, both the “Project Title” (CP1) and “Project ID” (CP2) denote an OBI:Investigation; the “Project Description” (CP3) is_about the same OBI:Investigation.

Mentions: After the list of terms to be included in the Core Project, Core Sample, and Sequencing Assay data fields were compiled, they were then assembled into a semantic network based on OBI and other OBO Foundry compatible ontologies and relations (Figures 2 and 3). The goal of this process was to define the relationships between the various data fields and identify any gaps or inconsistencies that existed in the original list of data fields. For example, only one temporal data field for any given sequenced specimen was included in the first draft of Core Sample terms; however, it quickly became apparent that one time point was insufficient since a time measurement may be assigned when an infectious agent was first collected from the specimen source organism, when the health status of the specimen source was assessed, when the sample material was extracted from the specimen, when the sample material was subjected to an experimental assay, etc. (Figure 3). Indeed, all processes occur within their own timespan, and their temporal relationships can have important implications for the interpretation of the resulting sequence record. Although these temporal (and other) relationships are implied, constructing a formal semantic representation allowed us to correct similar omissions and clarify meanings that had previously been unintentionally ambiguous. In the case of Core Project, similar data fields were identified for each of the main parties involved – sample providers, assay centers and bioinformatics centers (Figure 2).


Standardized metadata for human pathogen/vector genomic sequences.

Dugan VG, Emrich SJ, Giraldo-Calderón GI, Harb OS, Newman RM, Pickett BE, Schriml LM, Stockwell TB, Stoeckert CJ, Sullivan DE, Singh I, Ward DV, Yao A, Zheng J, Barrett T, Birren B, Brinkac L, Bruno VM, Caler E, Chapman S, Collins FH, Cuomo CA, Di Francesco V, Durkin S, Eppinger M, Feldgarden M, Fraser C, Fricke WF, Giovanni M, Henn MR, Hine E, Hotopp JD, Karsch-Mizrachi I, Kissinger JC, Lee EM, Mathur P, Mongodin EF, Murphy CI, Myers G, Neafsey DE, Nelson KE, Nierman WC, Puzak J, Rasko D, Roos DS, Sadzewicz L, Silva JC, Sobral B, Squires RB, Stevens RL, Tallon L, Tettelin H, Wentworth D, White O, Will R, Wortman J, Zhang Y, Scheuermann RH - PLoS ONE (2014)

Semantic Network of the Core Project Data Fields.A semantic representation of the entities relevant to describe infectious disease projects based on the OBI and other OBO Foundry ontologies is shown. Distinctions are made between material entities (blue outlines), information entities and qualities (black outlines), and processes (red outlines). Entities are connected by standard semantic relations, in italic. The subset of entities selected as Core Project fields are noted with ovals containing the respective Field ID. For example, both the “Project Title” (CP1) and “Project ID” (CP2) denote an OBI:Investigation; the “Project Description” (CP3) is_about the same OBI:Investigation.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4061050&req=5

pone-0099979-g002: Semantic Network of the Core Project Data Fields.A semantic representation of the entities relevant to describe infectious disease projects based on the OBI and other OBO Foundry ontologies is shown. Distinctions are made between material entities (blue outlines), information entities and qualities (black outlines), and processes (red outlines). Entities are connected by standard semantic relations, in italic. The subset of entities selected as Core Project fields are noted with ovals containing the respective Field ID. For example, both the “Project Title” (CP1) and “Project ID” (CP2) denote an OBI:Investigation; the “Project Description” (CP3) is_about the same OBI:Investigation.
Mentions: After the list of terms to be included in the Core Project, Core Sample, and Sequencing Assay data fields were compiled, they were then assembled into a semantic network based on OBI and other OBO Foundry compatible ontologies and relations (Figures 2 and 3). The goal of this process was to define the relationships between the various data fields and identify any gaps or inconsistencies that existed in the original list of data fields. For example, only one temporal data field for any given sequenced specimen was included in the first draft of Core Sample terms; however, it quickly became apparent that one time point was insufficient since a time measurement may be assigned when an infectious agent was first collected from the specimen source organism, when the health status of the specimen source was assessed, when the sample material was extracted from the specimen, when the sample material was subjected to an experimental assay, etc. (Figure 3). Indeed, all processes occur within their own timespan, and their temporal relationships can have important implications for the interpretation of the resulting sequence record. Although these temporal (and other) relationships are implied, constructing a formal semantic representation allowed us to correct similar omissions and clarify meanings that had previously been unintentionally ambiguous. In the case of Core Project, similar data fields were identified for each of the main parties involved – sample providers, assay centers and bioinformatics centers (Figure 2).

Bottom Line: The standard includes data fields about characteristics of the organism or environmental source of the specimen, spatial-temporal information about the specimen isolation event, phenotypic characteristics of the pathogen/vector isolated, and project leadership and support.By modeling metadata fields into an ontology-based semantic framework and reusing existing ontologies and minimum information checklists, the application standard can be extended to support additional project-specific data fields and integrated with other data represented with comparable standards.The use of this metadata standard by all ongoing and future GSCID sequencing projects will provide a consistent representation of these data in the BRC resources and other repositories that leverage these data, allowing investigators to identify relevant genomic sequences and perform comparative genomics analyses that are both statistically meaningful and biologically relevant.

View Article: PubMed Central - PubMed

Affiliation: J. Craig Venter Institute, Rockville, Maryland, and La Jolla, California, United States of America; National Institute of Allergy and Infectious Diseases, Rockville, Maryland, United States of America.

ABSTRACT
High throughput sequencing has accelerated the determination of genome sequences for thousands of human infectious disease pathogens and dozens of their vectors. The scale and scope of these data are enabling genotype-phenotype association studies to identify genetic determinants of pathogen virulence and drug/insecticide resistance, and phylogenetic studies to track the origin and spread of disease outbreaks. To maximize the utility of genomic sequences for these purposes, it is essential that metadata about the pathogen/vector isolate characteristics be collected and made available in organized, clear, and consistent formats. Here we report the development of the GSCID/BRC Project and Sample Application Standard, developed by representatives of the Genome Sequencing Centers for Infectious Diseases (GSCIDs), the Bioinformatics Resource Centers (BRCs) for Infectious Diseases, and the U.S. National Institute of Allergy and Infectious Diseases (NIAID), part of the National Institutes of Health (NIH), informed by interactions with numerous collaborating scientists. It includes mapping to terms from other data standards initiatives, including the Genomic Standards Consortium's minimal information (MIxS) and NCBI's BioSample/BioProjects checklists and the Ontology for Biomedical Investigations (OBI). The standard includes data fields about characteristics of the organism or environmental source of the specimen, spatial-temporal information about the specimen isolation event, phenotypic characteristics of the pathogen/vector isolated, and project leadership and support. By modeling metadata fields into an ontology-based semantic framework and reusing existing ontologies and minimum information checklists, the application standard can be extended to support additional project-specific data fields and integrated with other data represented with comparable standards. The use of this metadata standard by all ongoing and future GSCID sequencing projects will provide a consistent representation of these data in the BRC resources and other repositories that leverage these data, allowing investigators to identify relevant genomic sequences and perform comparative genomics analyses that are both statistically meaningful and biologically relevant.

Show MeSH
Related in: MedlinePlus