Limits...
CorrelaGenes: a new tool for the interpretation of the human transcriptome.

Cremaschi P, Rovida S, Sacchi L, Lisa A, Calvi F, Montecucco A, Biamonti G, Bione S, Sacchi G - BMC Bioinformatics (2014)

Bottom Line: The significance of the correlation is measured coupling the Lift, a well-known standard ARM index, and the χ(2) p value.The manually curated selection of the comparisons and the developed algorithm constitute a new approach in the field of gene expression profiling studies.The preliminary results of the simulation showed how CorrelaGenes could contribute to the characterization of molecular pathways and biological processes integrating data obtained from other applications and available in public repositories.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: The amount of gene expression data available in public repositories has grown exponentially in the last years, now requiring new data mining tools to transform them in information easily accessible to biologists.

Results: By exploiting expression data publicly available in the Gene Expression Omnibus (GEO) database, we developed a new bioinformatics tool aimed at the identification of genes whose expression appeared simultaneously altered in different experimental conditions, thus suggesting co-regulation or coordinated action in the same biological process. To accomplish this task, we used the 978 human GEO Curated DataSets and we manually performed the selection of 2,109 pair-wise comparisons based on their biological rationale. The lists of differentially expressed genes, obtained from the selected comparisons, were stored in a PostgreSQL database and used as data source for the CorrelaGenes tool. Our application uses a customized Association Rule Mining (ARM) algorithm to identify sets of genes showing expression profiles correlated with a gene of interest. The significance of the correlation is measured coupling the Lift, a well-known standard ARM index, and the χ(2) p value. The manually curated selection of the comparisons and the developed algorithm constitute a new approach in the field of gene expression profiling studies. Simulation performed on 100 randomly selected target genes allowed us to evaluate the efficiency of the procedure and to obtain preliminary data demonstrating the consistency of the results.

Conclusions: The preliminary results of the simulation showed how CorrelaGenes could contribute to the characterization of molecular pathways and biological processes integrating data obtained from other applications and available in public repositories.

Show MeSH

Related in: MedlinePlus

Schematic representation of GDS2516 processing. (A) Design of GDS2516 and experimental factor definition (F1 to F7). (B) Contrast matrix created with all groups versus group comparisons. In light blue are shown the 21 pair-wise comparisons. (C) Comparisons manually selected by the experts. In dark blue are shown the 5 comparisons selected.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4016313&req=5

Figure 2: Schematic representation of GDS2516 processing. (A) Design of GDS2516 and experimental factor definition (F1 to F7). (B) Contrast matrix created with all groups versus group comparisons. In light blue are shown the 21 pair-wise comparisons. (C) Comparisons manually selected by the experts. In dark blue are shown the 5 comparisons selected.

Mentions: The source of the expression data used by the CorrelaGenes tool was built from 978 human GEO Curated DataSets (GDS) downloaded from the GEO archive through the GEOquery 2.21.9 R/Bioconductor package [10]. The experimental design related to each GDS was used to group the intensity values measured for single sample sharing the same experimental factors (Figure 2A). Groups containing less than two samples were not suitable for subsequent analysis thus leading to discard a total of 261 GDS. The resulting groups were used to create a contrast matrix including all groups versus group comparisons (Figure 2B). A manually curated knowledge-based procedure was applied to select appropriate comparisons: as automatically generated matrices often include contrasts without a clear experimental meaning, a team of biologists defined a set of rules to extract those comparisons showing a strong biological rationale (Figure 2C). Figure 2 shows the procedure applied to GDS2516 (see also Additional File 1 for a more detailed description of the whole procedure). In this example a total of seven experimental factors were identified and used to create an all groups versus group contrast matrix of 21 comparisons among which only five were selected by the experts. This procedure brought to the selection of 2,109 pairwise comparisons in 717 GDS. In 1,876 out of 2,109 comparisons a "control" experimental factor was detectable thus allowing the definition of the sign of the altered expression measure (i.e. genes up- or down-regulated).


CorrelaGenes: a new tool for the interpretation of the human transcriptome.

Cremaschi P, Rovida S, Sacchi L, Lisa A, Calvi F, Montecucco A, Biamonti G, Bione S, Sacchi G - BMC Bioinformatics (2014)

Schematic representation of GDS2516 processing. (A) Design of GDS2516 and experimental factor definition (F1 to F7). (B) Contrast matrix created with all groups versus group comparisons. In light blue are shown the 21 pair-wise comparisons. (C) Comparisons manually selected by the experts. In dark blue are shown the 5 comparisons selected.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4016313&req=5

Figure 2: Schematic representation of GDS2516 processing. (A) Design of GDS2516 and experimental factor definition (F1 to F7). (B) Contrast matrix created with all groups versus group comparisons. In light blue are shown the 21 pair-wise comparisons. (C) Comparisons manually selected by the experts. In dark blue are shown the 5 comparisons selected.
Mentions: The source of the expression data used by the CorrelaGenes tool was built from 978 human GEO Curated DataSets (GDS) downloaded from the GEO archive through the GEOquery 2.21.9 R/Bioconductor package [10]. The experimental design related to each GDS was used to group the intensity values measured for single sample sharing the same experimental factors (Figure 2A). Groups containing less than two samples were not suitable for subsequent analysis thus leading to discard a total of 261 GDS. The resulting groups were used to create a contrast matrix including all groups versus group comparisons (Figure 2B). A manually curated knowledge-based procedure was applied to select appropriate comparisons: as automatically generated matrices often include contrasts without a clear experimental meaning, a team of biologists defined a set of rules to extract those comparisons showing a strong biological rationale (Figure 2C). Figure 2 shows the procedure applied to GDS2516 (see also Additional File 1 for a more detailed description of the whole procedure). In this example a total of seven experimental factors were identified and used to create an all groups versus group contrast matrix of 21 comparisons among which only five were selected by the experts. This procedure brought to the selection of 2,109 pairwise comparisons in 717 GDS. In 1,876 out of 2,109 comparisons a "control" experimental factor was detectable thus allowing the definition of the sign of the altered expression measure (i.e. genes up- or down-regulated).

Bottom Line: The significance of the correlation is measured coupling the Lift, a well-known standard ARM index, and the χ(2) p value.The manually curated selection of the comparisons and the developed algorithm constitute a new approach in the field of gene expression profiling studies.The preliminary results of the simulation showed how CorrelaGenes could contribute to the characterization of molecular pathways and biological processes integrating data obtained from other applications and available in public repositories.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: The amount of gene expression data available in public repositories has grown exponentially in the last years, now requiring new data mining tools to transform them in information easily accessible to biologists.

Results: By exploiting expression data publicly available in the Gene Expression Omnibus (GEO) database, we developed a new bioinformatics tool aimed at the identification of genes whose expression appeared simultaneously altered in different experimental conditions, thus suggesting co-regulation or coordinated action in the same biological process. To accomplish this task, we used the 978 human GEO Curated DataSets and we manually performed the selection of 2,109 pair-wise comparisons based on their biological rationale. The lists of differentially expressed genes, obtained from the selected comparisons, were stored in a PostgreSQL database and used as data source for the CorrelaGenes tool. Our application uses a customized Association Rule Mining (ARM) algorithm to identify sets of genes showing expression profiles correlated with a gene of interest. The significance of the correlation is measured coupling the Lift, a well-known standard ARM index, and the χ(2) p value. The manually curated selection of the comparisons and the developed algorithm constitute a new approach in the field of gene expression profiling studies. Simulation performed on 100 randomly selected target genes allowed us to evaluate the efficiency of the procedure and to obtain preliminary data demonstrating the consistency of the results.

Conclusions: The preliminary results of the simulation showed how CorrelaGenes could contribute to the characterization of molecular pathways and biological processes integrating data obtained from other applications and available in public repositories.

Show MeSH
Related in: MedlinePlus