Limits...
Matching of array CGH and gene expression microarray features for the purpose of integrative genomic analyses.

van Wieringen WN, Unger K, Leday GG, Krijgsman O, de Menezes RX, Ylstra B, van de Wiel MA - BMC Bioinformatics (2012)

Bottom Line: Although important, many integrative analyses do not or insufficiently detail the matching of the platforms.Illustration of the matching procedures on a variety of data sets reveals how the procedures differ in the use of the available data, and may even lead to different results for individual genes.Matching of data from multiple genomics platforms is an important preprocessing step for many integrative bioinformatic analysis, for which we present six generic procedures, both old and new.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands. w.vanwieringen@vumc.nl

ABSTRACT

Background: An increasing number of genomic studies interrogating more than one molecular level is published. Bioinformatics follows biological practice, and recent years have seen a surge in methodology for the integrative analysis of genomic data. Often such analyses require knowledge of which elements of one platform link to those of another. Although important, many integrative analyses do not or insufficiently detail the matching of the platforms.

Results: We describe, illustrate and discuss six matching procedures. They are implemented in the R-package sigaR (available from Bioconductor). The principles underlying the presented matching procedures are generic, and can be combined to form new matching approaches or be applied to the matching of other platforms. Illustration of the matching procedures on a variety of data sets reveals how the procedures differ in the use of the available data, and may even lead to different results for individual genes.

Conclusions: Matching of data from multiple genomics platforms is an important preprocessing step for many integrative bioinformatic analysis, for which we present six generic procedures, both old and new. They have been implemented in the R-package sigaR, available from Bioconductor.

Show MeSH
Flowchart of cis-effect analysis.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3475006&req=5

Figure 9: Flowchart of cis-effect analysis.

Mentions: Finally, we illustrate the effect of matching on downstream analysis. We assess the cis-effect of a DNA copy number aberration on the expression levels of the genes mapping to it. Hereto, we employ piecewise linear regression splines (PLRS) to allow any plausible type of relationship between the two molecular levels [Leday GGR, Van der Vaart AW, Van Wieringen WN, Van de Wiel MA (2012) “Modeling association between DNA copy number and gene expression with constrained piecewise linear regression splines”, submitted; 31]. A gene’s association between gene dosage and expression levels is declared significant if its corrected p-value (Benjamini-Hochberg multiple testing correction) is smaller than 0.05. The associated workflow is portrayed in Figure 9. Table 10 reports the number of significant genes for each data set - matching procedure combination (the Taylor I and II data sets are excluded for being uninformative, neither provided anything significant). This number is reported on the whole set of genes matched by each procedure (the size of this set can be found in Table 7), but also on the restricted set containing only those genes that are matched by all procedures. The Taylor I and Taylor II data sets are not discriminative between the matching procedures. For the Chin data set the distance procedure finds most significant genes, followed by distanceAny (< 100 k), overlapPlus and other overlap methods. This order is concordant with the matching result: the more matched genes, the more discoveries. This may obscure the comparison of the methods. Moreover, as pointed out before, the distance and distanceAny (with a large window size) procedures may match genes to DNA copy number features located elsewhere on the genome. This raises doubts over the interpretation of significant associations. In the restricted set of genes, the number of discoveries is constant over the methods, with the overlapAny procedure having one additional finding. In the TCGA I and TCGA II data sets, irrespective of the gene sets considered, the overlapAny procedures yield most significant findings. Noteworthy is the fact that even though the distance-based procedures match hundreds (TCGA I) or thousands (TCGA II) of genes more, this does not lead to more discoveries. This could be interpreted as the additionally matched genes being assigned an unrelated DNA copy number signature. In summary, this comparison of downstream analyses suggests that (at least in data sets generated on a high-resolution DNA copy number platform) the overlapAny procedure may be preferred.


Matching of array CGH and gene expression microarray features for the purpose of integrative genomic analyses.

van Wieringen WN, Unger K, Leday GG, Krijgsman O, de Menezes RX, Ylstra B, van de Wiel MA - BMC Bioinformatics (2012)

Flowchart of cis-effect analysis.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3475006&req=5

Figure 9: Flowchart of cis-effect analysis.
Mentions: Finally, we illustrate the effect of matching on downstream analysis. We assess the cis-effect of a DNA copy number aberration on the expression levels of the genes mapping to it. Hereto, we employ piecewise linear regression splines (PLRS) to allow any plausible type of relationship between the two molecular levels [Leday GGR, Van der Vaart AW, Van Wieringen WN, Van de Wiel MA (2012) “Modeling association between DNA copy number and gene expression with constrained piecewise linear regression splines”, submitted; 31]. A gene’s association between gene dosage and expression levels is declared significant if its corrected p-value (Benjamini-Hochberg multiple testing correction) is smaller than 0.05. The associated workflow is portrayed in Figure 9. Table 10 reports the number of significant genes for each data set - matching procedure combination (the Taylor I and II data sets are excluded for being uninformative, neither provided anything significant). This number is reported on the whole set of genes matched by each procedure (the size of this set can be found in Table 7), but also on the restricted set containing only those genes that are matched by all procedures. The Taylor I and Taylor II data sets are not discriminative between the matching procedures. For the Chin data set the distance procedure finds most significant genes, followed by distanceAny (< 100 k), overlapPlus and other overlap methods. This order is concordant with the matching result: the more matched genes, the more discoveries. This may obscure the comparison of the methods. Moreover, as pointed out before, the distance and distanceAny (with a large window size) procedures may match genes to DNA copy number features located elsewhere on the genome. This raises doubts over the interpretation of significant associations. In the restricted set of genes, the number of discoveries is constant over the methods, with the overlapAny procedure having one additional finding. In the TCGA I and TCGA II data sets, irrespective of the gene sets considered, the overlapAny procedures yield most significant findings. Noteworthy is the fact that even though the distance-based procedures match hundreds (TCGA I) or thousands (TCGA II) of genes more, this does not lead to more discoveries. This could be interpreted as the additionally matched genes being assigned an unrelated DNA copy number signature. In summary, this comparison of downstream analyses suggests that (at least in data sets generated on a high-resolution DNA copy number platform) the overlapAny procedure may be preferred.

Bottom Line: Although important, many integrative analyses do not or insufficiently detail the matching of the platforms.Illustration of the matching procedures on a variety of data sets reveals how the procedures differ in the use of the available data, and may even lead to different results for individual genes.Matching of data from multiple genomics platforms is an important preprocessing step for many integrative bioinformatic analysis, for which we present six generic procedures, both old and new.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands. w.vanwieringen@vumc.nl

ABSTRACT

Background: An increasing number of genomic studies interrogating more than one molecular level is published. Bioinformatics follows biological practice, and recent years have seen a surge in methodology for the integrative analysis of genomic data. Often such analyses require knowledge of which elements of one platform link to those of another. Although important, many integrative analyses do not or insufficiently detail the matching of the platforms.

Results: We describe, illustrate and discuss six matching procedures. They are implemented in the R-package sigaR (available from Bioconductor). The principles underlying the presented matching procedures are generic, and can be combined to form new matching approaches or be applied to the matching of other platforms. Illustration of the matching procedures on a variety of data sets reveals how the procedures differ in the use of the available data, and may even lead to different results for individual genes.

Conclusions: Matching of data from multiple genomics platforms is an important preprocessing step for many integrative bioinformatic analysis, for which we present six generic procedures, both old and new. They have been implemented in the R-package sigaR, available from Bioconductor.

Show MeSH