Limits...
BigQ: a NoSQL based framework to handle genomic variants in i2b2.

Gabetta M, Limongelli I, Rizzo E, Riva A, Segagni D, Bellazzi R - BMC Bioinformatics (2015)

Bottom Line: A visual programming i2b2 plugin allows retrieving variants belonging to the patients in a cohort by applying filters on genomic variant annotations.In this paper we describe a new i2b2 web service composed of an efficient and scalable document-based database that manages annotations of genomic variants and of a visual programming plug-in designed to dynamically perform queries on clinical and genetic data.The system therefore allows managing the fast growing volume of genomic variants and can be used to integrate heterogeneous genomic annotations.

View Article: PubMed Central - PubMed

Affiliation: Dipartimento di Ingegneria Industriale e dell'Informazione and Center for Health Technologies, Università di Pavia, Pavia, Italy. matteo.gabetta@unipv.it.

ABSTRACT

Background: Precision medicine requires the tight integration of clinical and molecular data. To this end, it is mandatory to define proper technological solutions able to manage the overwhelming amount of high throughput genomic data needed to test associations between genomic signatures and human phenotypes. The i2b2 Center (Informatics for Integrating Biology and the Bedside) has developed a widely internationally adopted framework to use existing clinical data for discovery research that can help the definition of precision medicine interventions when coupled with genetic data. i2b2 can be significantly advanced by designing efficient management solutions of Next Generation Sequencing data.

Results: We developed BigQ, an extension of the i2b2 framework, which integrates patient clinical phenotypes with genomic variant profiles generated by Next Generation Sequencing. A visual programming i2b2 plugin allows retrieving variants belonging to the patients in a cohort by applying filters on genomic variant annotations. We report an evaluation of the query performance of our system on more than 11 million variants, showing that the implemented solution scales linearly in terms of query time and disk space with the number of variants.

Conclusions: In this paper we describe a new i2b2 web service composed of an efficient and scalable document-based database that manages annotations of genomic variants and of a visual programming plug-in designed to dynamically perform queries on clinical and genetic data. The system therefore allows managing the fast growing volume of genomic variants and can be used to integrate heterogeneous genomic annotations.

Show MeSH
The simplified binning scheme and search strategy implemented in CouchDB. For the genomic feature A, the smallest containing bin is the one reached by navigating the tree in the following way: 0,0,1,0,0 (in red). For Feature B and Feature U the smallest bins are (0,1) and (0,0,1,1,1) respectively. Given the interval query Q, its smallest containing bin is the one coded by (0,0,1). When searching for genomic features within the corresponding overlapping bins, both for the lower and upper part of the tree, genomic feature U would also be reported: in fact, despite overlapping with one of the searched bins, it does not overlap with Q. Therefore, two more queries (views) are performed in order to remove the non-overlapping elements: the first adds the start position of the genomic feature to the view keys (patient id, chromosome, bin, start) while the second one adds the stop position. In this example, genomic feature U would be removed from the query result set because its start position is greater than the end one of Q
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4696314&req=5

Fig3: The simplified binning scheme and search strategy implemented in CouchDB. For the genomic feature A, the smallest containing bin is the one reached by navigating the tree in the following way: 0,0,1,0,0 (in red). For Feature B and Feature U the smallest bins are (0,1) and (0,0,1,1,1) respectively. Given the interval query Q, its smallest containing bin is the one coded by (0,0,1). When searching for genomic features within the corresponding overlapping bins, both for the lower and upper part of the tree, genomic feature U would also be reported: in fact, despite overlapping with one of the searched bins, it does not overlap with Q. Therefore, two more queries (views) are performed in order to remove the non-overlapping elements: the first adds the start position of the genomic feature to the view keys (patient id, chromosome, bin, start) while the second one adds the stop position. In this example, genomic feature U would be removed from the query result set because its start position is greater than the end one of Q

Mentions: Each chromosome has been divided into a predefined set of hierarchical bins (called “tree”) depending on the specific chromosome length. The value 0 or 1 has been assigned to each bin, depending on being the left or right child bin within the tree, respectively (with the exception of the root bin, encoded by 0). A code is then assigned to each genomic feature (i.e. variants), corresponding to the ordered series of 0s and 1s given by navigating the binning tree from the root to the smallest bin that entirely contains the variant (see Fig. 3). The bin code is then represented as a JSON attribute, whose value is an array of 0s and 1s.Fig. 3


BigQ: a NoSQL based framework to handle genomic variants in i2b2.

Gabetta M, Limongelli I, Rizzo E, Riva A, Segagni D, Bellazzi R - BMC Bioinformatics (2015)

The simplified binning scheme and search strategy implemented in CouchDB. For the genomic feature A, the smallest containing bin is the one reached by navigating the tree in the following way: 0,0,1,0,0 (in red). For Feature B and Feature U the smallest bins are (0,1) and (0,0,1,1,1) respectively. Given the interval query Q, its smallest containing bin is the one coded by (0,0,1). When searching for genomic features within the corresponding overlapping bins, both for the lower and upper part of the tree, genomic feature U would also be reported: in fact, despite overlapping with one of the searched bins, it does not overlap with Q. Therefore, two more queries (views) are performed in order to remove the non-overlapping elements: the first adds the start position of the genomic feature to the view keys (patient id, chromosome, bin, start) while the second one adds the stop position. In this example, genomic feature U would be removed from the query result set because its start position is greater than the end one of Q
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4696314&req=5

Fig3: The simplified binning scheme and search strategy implemented in CouchDB. For the genomic feature A, the smallest containing bin is the one reached by navigating the tree in the following way: 0,0,1,0,0 (in red). For Feature B and Feature U the smallest bins are (0,1) and (0,0,1,1,1) respectively. Given the interval query Q, its smallest containing bin is the one coded by (0,0,1). When searching for genomic features within the corresponding overlapping bins, both for the lower and upper part of the tree, genomic feature U would also be reported: in fact, despite overlapping with one of the searched bins, it does not overlap with Q. Therefore, two more queries (views) are performed in order to remove the non-overlapping elements: the first adds the start position of the genomic feature to the view keys (patient id, chromosome, bin, start) while the second one adds the stop position. In this example, genomic feature U would be removed from the query result set because its start position is greater than the end one of Q
Mentions: Each chromosome has been divided into a predefined set of hierarchical bins (called “tree”) depending on the specific chromosome length. The value 0 or 1 has been assigned to each bin, depending on being the left or right child bin within the tree, respectively (with the exception of the root bin, encoded by 0). A code is then assigned to each genomic feature (i.e. variants), corresponding to the ordered series of 0s and 1s given by navigating the binning tree from the root to the smallest bin that entirely contains the variant (see Fig. 3). The bin code is then represented as a JSON attribute, whose value is an array of 0s and 1s.Fig. 3

Bottom Line: A visual programming i2b2 plugin allows retrieving variants belonging to the patients in a cohort by applying filters on genomic variant annotations.In this paper we describe a new i2b2 web service composed of an efficient and scalable document-based database that manages annotations of genomic variants and of a visual programming plug-in designed to dynamically perform queries on clinical and genetic data.The system therefore allows managing the fast growing volume of genomic variants and can be used to integrate heterogeneous genomic annotations.

View Article: PubMed Central - PubMed

Affiliation: Dipartimento di Ingegneria Industriale e dell'Informazione and Center for Health Technologies, Università di Pavia, Pavia, Italy. matteo.gabetta@unipv.it.

ABSTRACT

Background: Precision medicine requires the tight integration of clinical and molecular data. To this end, it is mandatory to define proper technological solutions able to manage the overwhelming amount of high throughput genomic data needed to test associations between genomic signatures and human phenotypes. The i2b2 Center (Informatics for Integrating Biology and the Bedside) has developed a widely internationally adopted framework to use existing clinical data for discovery research that can help the definition of precision medicine interventions when coupled with genetic data. i2b2 can be significantly advanced by designing efficient management solutions of Next Generation Sequencing data.

Results: We developed BigQ, an extension of the i2b2 framework, which integrates patient clinical phenotypes with genomic variant profiles generated by Next Generation Sequencing. A visual programming i2b2 plugin allows retrieving variants belonging to the patients in a cohort by applying filters on genomic variant annotations. We report an evaluation of the query performance of our system on more than 11 million variants, showing that the implemented solution scales linearly in terms of query time and disk space with the number of variants.

Conclusions: In this paper we describe a new i2b2 web service composed of an efficient and scalable document-based database that manages annotations of genomic variants and of a visual programming plug-in designed to dynamically perform queries on clinical and genetic data. The system therefore allows managing the fast growing volume of genomic variants and can be used to integrate heterogeneous genomic annotations.

Show MeSH