Limits...
BigQ: a NoSQL based framework to handle genomic variants in i2b2.

Gabetta M, Limongelli I, Rizzo E, Riva A, Segagni D, Bellazzi R - BMC Bioinformatics (2015)

Bottom Line: A visual programming i2b2 plugin allows retrieving variants belonging to the patients in a cohort by applying filters on genomic variant annotations.In this paper we describe a new i2b2 web service composed of an efficient and scalable document-based database that manages annotations of genomic variants and of a visual programming plug-in designed to dynamically perform queries on clinical and genetic data.The system therefore allows managing the fast growing volume of genomic variants and can be used to integrate heterogeneous genomic annotations.

View Article: PubMed Central - PubMed

Affiliation: Dipartimento di Ingegneria Industriale e dell'Informazione and Center for Health Technologies, Università di Pavia, Pavia, Italy. matteo.gabetta@unipv.it.

ABSTRACT

Background: Precision medicine requires the tight integration of clinical and molecular data. To this end, it is mandatory to define proper technological solutions able to manage the overwhelming amount of high throughput genomic data needed to test associations between genomic signatures and human phenotypes. The i2b2 Center (Informatics for Integrating Biology and the Bedside) has developed a widely internationally adopted framework to use existing clinical data for discovery research that can help the definition of precision medicine interventions when coupled with genetic data. i2b2 can be significantly advanced by designing efficient management solutions of Next Generation Sequencing data.

Results: We developed BigQ, an extension of the i2b2 framework, which integrates patient clinical phenotypes with genomic variant profiles generated by Next Generation Sequencing. A visual programming i2b2 plugin allows retrieving variants belonging to the patients in a cohort by applying filters on genomic variant annotations. We report an evaluation of the query performance of our system on more than 11 million variants, showing that the implemented solution scales linearly in terms of query time and disk space with the number of variants.

Conclusions: In this paper we describe a new i2b2 web service composed of an efficient and scalable document-based database that manages annotations of genomic variants and of a visual programming plug-in designed to dynamically perform queries on clinical and genetic data. The system therefore allows managing the fast growing volume of genomic variants and can be used to integrate heterogeneous genomic annotations.

Show MeSH

Related in: MedlinePlus

Example of annotated variants in JSON format. Two variants, represented in VCF format, are identified by standard attributes (reference genome, chromosome, variant position and nucleotide changes). The annotation step finds out that one variant falls in an exon causing an amino acid substitution at the protein level (non-synonymous) while the other is located in a transcript splicing site. The two variants generate two different JSON objects, characterized by different attributes. Differences between JSONs are highlighted in bold
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4696314&req=5

Fig2: Example of annotated variants in JSON format. Two variants, represented in VCF format, are identified by standard attributes (reference genome, chromosome, variant position and nucleotide changes). The annotation step finds out that one variant falls in an exon causing an amino acid substitution at the protein level (non-synonymous) while the other is located in a transcript splicing site. The two variants generate two different JSON objects, characterized by different attributes. Differences between JSONs are highlighted in bold

Mentions: The output of ANNOVAR is then used to create one JSON document for each variant belonging to a single patient. JSON is an open standard format used to transmit data objects consisting of attribute-value pairs. In our case, the set of possible attributes consist of data coming from the original VCF file, and of the functional annotations added by ANNOVAR (see Additional file 1: Table S1). Finally, each document is associated with a unique universal identifier (UUID) and sent to CouchDB. Notably, different types of variants may be represented by different data structures. For example, a deep intergenic variant may not have any gene associated with it, or a synonymous coding variant does not hold data about prediction scores such as PolyPhen-2 and SIFT (suitable only for non-synonymous variants). We have therefore developed an extensible object-oriented data structure, written in Java, able to model different kind of variants and annotations. One may want to add a new genomic track, or may not be interested in using others. Therefore, each variant type is modeled and treated individually, and holds only the needed attributes (see Fig. 2).Fig. 2


BigQ: a NoSQL based framework to handle genomic variants in i2b2.

Gabetta M, Limongelli I, Rizzo E, Riva A, Segagni D, Bellazzi R - BMC Bioinformatics (2015)

Example of annotated variants in JSON format. Two variants, represented in VCF format, are identified by standard attributes (reference genome, chromosome, variant position and nucleotide changes). The annotation step finds out that one variant falls in an exon causing an amino acid substitution at the protein level (non-synonymous) while the other is located in a transcript splicing site. The two variants generate two different JSON objects, characterized by different attributes. Differences between JSONs are highlighted in bold
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4696314&req=5

Fig2: Example of annotated variants in JSON format. Two variants, represented in VCF format, are identified by standard attributes (reference genome, chromosome, variant position and nucleotide changes). The annotation step finds out that one variant falls in an exon causing an amino acid substitution at the protein level (non-synonymous) while the other is located in a transcript splicing site. The two variants generate two different JSON objects, characterized by different attributes. Differences between JSONs are highlighted in bold
Mentions: The output of ANNOVAR is then used to create one JSON document for each variant belonging to a single patient. JSON is an open standard format used to transmit data objects consisting of attribute-value pairs. In our case, the set of possible attributes consist of data coming from the original VCF file, and of the functional annotations added by ANNOVAR (see Additional file 1: Table S1). Finally, each document is associated with a unique universal identifier (UUID) and sent to CouchDB. Notably, different types of variants may be represented by different data structures. For example, a deep intergenic variant may not have any gene associated with it, or a synonymous coding variant does not hold data about prediction scores such as PolyPhen-2 and SIFT (suitable only for non-synonymous variants). We have therefore developed an extensible object-oriented data structure, written in Java, able to model different kind of variants and annotations. One may want to add a new genomic track, or may not be interested in using others. Therefore, each variant type is modeled and treated individually, and holds only the needed attributes (see Fig. 2).Fig. 2

Bottom Line: A visual programming i2b2 plugin allows retrieving variants belonging to the patients in a cohort by applying filters on genomic variant annotations.In this paper we describe a new i2b2 web service composed of an efficient and scalable document-based database that manages annotations of genomic variants and of a visual programming plug-in designed to dynamically perform queries on clinical and genetic data.The system therefore allows managing the fast growing volume of genomic variants and can be used to integrate heterogeneous genomic annotations.

View Article: PubMed Central - PubMed

Affiliation: Dipartimento di Ingegneria Industriale e dell'Informazione and Center for Health Technologies, Università di Pavia, Pavia, Italy. matteo.gabetta@unipv.it.

ABSTRACT

Background: Precision medicine requires the tight integration of clinical and molecular data. To this end, it is mandatory to define proper technological solutions able to manage the overwhelming amount of high throughput genomic data needed to test associations between genomic signatures and human phenotypes. The i2b2 Center (Informatics for Integrating Biology and the Bedside) has developed a widely internationally adopted framework to use existing clinical data for discovery research that can help the definition of precision medicine interventions when coupled with genetic data. i2b2 can be significantly advanced by designing efficient management solutions of Next Generation Sequencing data.

Results: We developed BigQ, an extension of the i2b2 framework, which integrates patient clinical phenotypes with genomic variant profiles generated by Next Generation Sequencing. A visual programming i2b2 plugin allows retrieving variants belonging to the patients in a cohort by applying filters on genomic variant annotations. We report an evaluation of the query performance of our system on more than 11 million variants, showing that the implemented solution scales linearly in terms of query time and disk space with the number of variants.

Conclusions: In this paper we describe a new i2b2 web service composed of an efficient and scalable document-based database that manages annotations of genomic variants and of a visual programming plug-in designed to dynamically perform queries on clinical and genetic data. The system therefore allows managing the fast growing volume of genomic variants and can be used to integrate heterogeneous genomic annotations.

Show MeSH
Related in: MedlinePlus