Limits...
Combining transcriptional datasets using the generalized singular value decomposition.

Schreiber AW, Shirley NJ, Burton RA, Fincher GB - BMC Bioinformatics (2008)

Bottom Line: Because of platform-dependent systematic effects, simple comparisons or merging of datasets obtained by these technologies are difficult, even though they may often be desirable.The method enables us to discover putative candidate genes involved in the biosynthesis of the (1,3;1,4)-beta-D-glucan polysaccharide found in plant cell walls.We believe that it will prove to be particularly useful for exploiting large, publicly available, microarray datasets for species with unsequenced genomes by complementing them with more limited in-house expression measurements.

View Article: PubMed Central - HTML - PubMed

Affiliation: Australian Centre for Plant Functional Genomics, School of Agriculture and Wine, University of Adelaide, Waite Campus, Glen Osmond, SA 5064, Australia. andreas.schreiber@adelaide.edu.au

ABSTRACT

Background: Both microarrays and quantitative real-time PCR are convenient tools for studying the transcriptional levels of genes. The former is preferable for large scale studies while the latter is a more targeted technique. Because of platform-dependent systematic effects, simple comparisons or merging of datasets obtained by these technologies are difficult, even though they may often be desirable. These difficulties are exacerbated if there is only partial overlap between the experimental conditions and genes probed in the two datasets.

Results: We show here that the generalized singular value decomposition provides a practical tool for merging a small, targeted dataset obtained by quantitative real-time PCR of specific genes with a much larger microarray dataset. The technique permits, for the first time, the identification of genes present in only one dataset co-expressed with a target gene present exclusively in the other dataset, even when experimental conditions for the two datasets are not identical. With the rapidly increasing number of publically available large scale microarray datasets the latter is frequently the case. The method enables us to discover putative candidate genes involved in the biosynthesis of the (1,3;1,4)-beta-D-glucan polysaccharide found in plant cell walls.

Conclusion: We show that the generalized singular value decomposition provides a viable tool for a combined analysis of two gene expression datasets with only partial overlap of both gene sets and experimental conditions. We illustrate how the decomposition can be optimized self-consistently by using a judicious choice of genes to define it. The ability of the technique to seamlessly define a concept of "co-expression" across both datasets provides an avenue for meaningful data integration. We believe that it will prove to be particularly useful for exploiting large, publicly available, microarray datasets for species with unsequenced genomes by complementing them with more limited in-house expression measurements.

Show MeSH

Related in: MedlinePlus

Relation between co-expression within the central arraylets to co-expression in the microarray data. Panel A shows gene expression as measured in the arraylets defined by the GSVD using a set of genes described in the main text (green – low expression, red – high expression). Only the 200 transcripts whose expression profile in the central arraylets 5–7 (boxed) is closest to that of HvCslF3 (as measured by Euclidean distance) are shown. The expression profiles for the same genes, in the space spanned by arrays, are shown in panel B. Approximate co-expression can be seen for some tissues (e.g. expression in anther, caryopsis and endosperm tends to be low, while expression in root, radicle and coloeptile tends to be higher). At the top of panel B we indicate, for each tissue, the standard deviation of expression values among the genes shown on the plot, scaled with the corresponding quantity for the whole dataset; i.e. values larger (smaller) than one indicate larger (smaller) variability than average. As expected (see text), variability in expression is smallest in those tissues represented in both datasets.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2562393&req=5

Figure 5: Relation between co-expression within the central arraylets to co-expression in the microarray data. Panel A shows gene expression as measured in the arraylets defined by the GSVD using a set of genes described in the main text (green – low expression, red – high expression). Only the 200 transcripts whose expression profile in the central arraylets 5–7 (boxed) is closest to that of HvCslF3 (as measured by Euclidean distance) are shown. The expression profiles for the same genes, in the space spanned by arrays, are shown in panel B. Approximate co-expression can be seen for some tissues (e.g. expression in anther, caryopsis and endosperm tends to be low, while expression in root, radicle and coloeptile tends to be higher). At the top of panel B we indicate, for each tissue, the standard deviation of expression values among the genes shown on the plot, scaled with the corresponding quantity for the whole dataset; i.e. values larger (smaller) than one indicate larger (smaller) variability than average. As expected (see text), variability in expression is smallest in those tissues represented in both datasets.

Mentions: It is illustrative to compare this co-expression in the space spanned by arraylets to the expression profiles obtained directly from the microarray. In Fig. 5A we show a heatmap of 200 transcript abundances obtained with the microarray, ordered so that those co-expressing most closely with HvCslF3 in the central arraylets are at the top of the plot. The co-expression in the central arraylets is clearly visible. On the other hand, little or no co-expression in the peripheral arraylets characterising expression in non-overlapping parts of the datasets is apparent. For comparison, the corresponding expression profiles in the original space spanned by the arrays of the microarray experiment are shown in Fig. 5B. Some overall trends are apparent: expression in anther, caryopsis and endosperm tends to be low for these genes, while expression in root-like tissues and coleoptile tends to be high. More interesting, however, is the variation in expression among these genes. Co-expression in central arraylets should be reflected in stronger co-expression in tissues that are in common between the two platforms than those tissues that are not. As a measure of this variation we have listed, along the top of Fig. 5B, the standard deviation of expression among these 200 genes, scaled by the corresponding quantity for the whole dataset. We see that a selection of genes based on co-expression within central arraylets has resulted in a gene-set that is most tightly co-expressed in anther, caryopsis (5 dap), crown, inflorescence, pistil and radicle, but co-expressed less than average in caryopsis (8–10 & 14–16 dap) and mesocotyl. Comparing with Figure 1 we see that the former tissues are mostly those probed by both tissue series, while the latter are among those probed by the microarray alone. It appears therefore that, as expected, the central arraylets of the GSVD are indeed associated with those tissues for which there is some overlap between the two series.


Combining transcriptional datasets using the generalized singular value decomposition.

Schreiber AW, Shirley NJ, Burton RA, Fincher GB - BMC Bioinformatics (2008)

Relation between co-expression within the central arraylets to co-expression in the microarray data. Panel A shows gene expression as measured in the arraylets defined by the GSVD using a set of genes described in the main text (green – low expression, red – high expression). Only the 200 transcripts whose expression profile in the central arraylets 5–7 (boxed) is closest to that of HvCslF3 (as measured by Euclidean distance) are shown. The expression profiles for the same genes, in the space spanned by arrays, are shown in panel B. Approximate co-expression can be seen for some tissues (e.g. expression in anther, caryopsis and endosperm tends to be low, while expression in root, radicle and coloeptile tends to be higher). At the top of panel B we indicate, for each tissue, the standard deviation of expression values among the genes shown on the plot, scaled with the corresponding quantity for the whole dataset; i.e. values larger (smaller) than one indicate larger (smaller) variability than average. As expected (see text), variability in expression is smallest in those tissues represented in both datasets.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2562393&req=5

Figure 5: Relation between co-expression within the central arraylets to co-expression in the microarray data. Panel A shows gene expression as measured in the arraylets defined by the GSVD using a set of genes described in the main text (green – low expression, red – high expression). Only the 200 transcripts whose expression profile in the central arraylets 5–7 (boxed) is closest to that of HvCslF3 (as measured by Euclidean distance) are shown. The expression profiles for the same genes, in the space spanned by arrays, are shown in panel B. Approximate co-expression can be seen for some tissues (e.g. expression in anther, caryopsis and endosperm tends to be low, while expression in root, radicle and coloeptile tends to be higher). At the top of panel B we indicate, for each tissue, the standard deviation of expression values among the genes shown on the plot, scaled with the corresponding quantity for the whole dataset; i.e. values larger (smaller) than one indicate larger (smaller) variability than average. As expected (see text), variability in expression is smallest in those tissues represented in both datasets.
Mentions: It is illustrative to compare this co-expression in the space spanned by arraylets to the expression profiles obtained directly from the microarray. In Fig. 5A we show a heatmap of 200 transcript abundances obtained with the microarray, ordered so that those co-expressing most closely with HvCslF3 in the central arraylets are at the top of the plot. The co-expression in the central arraylets is clearly visible. On the other hand, little or no co-expression in the peripheral arraylets characterising expression in non-overlapping parts of the datasets is apparent. For comparison, the corresponding expression profiles in the original space spanned by the arrays of the microarray experiment are shown in Fig. 5B. Some overall trends are apparent: expression in anther, caryopsis and endosperm tends to be low for these genes, while expression in root-like tissues and coleoptile tends to be high. More interesting, however, is the variation in expression among these genes. Co-expression in central arraylets should be reflected in stronger co-expression in tissues that are in common between the two platforms than those tissues that are not. As a measure of this variation we have listed, along the top of Fig. 5B, the standard deviation of expression among these 200 genes, scaled by the corresponding quantity for the whole dataset. We see that a selection of genes based on co-expression within central arraylets has resulted in a gene-set that is most tightly co-expressed in anther, caryopsis (5 dap), crown, inflorescence, pistil and radicle, but co-expressed less than average in caryopsis (8–10 & 14–16 dap) and mesocotyl. Comparing with Figure 1 we see that the former tissues are mostly those probed by both tissue series, while the latter are among those probed by the microarray alone. It appears therefore that, as expected, the central arraylets of the GSVD are indeed associated with those tissues for which there is some overlap between the two series.

Bottom Line: Because of platform-dependent systematic effects, simple comparisons or merging of datasets obtained by these technologies are difficult, even though they may often be desirable.The method enables us to discover putative candidate genes involved in the biosynthesis of the (1,3;1,4)-beta-D-glucan polysaccharide found in plant cell walls.We believe that it will prove to be particularly useful for exploiting large, publicly available, microarray datasets for species with unsequenced genomes by complementing them with more limited in-house expression measurements.

View Article: PubMed Central - HTML - PubMed

Affiliation: Australian Centre for Plant Functional Genomics, School of Agriculture and Wine, University of Adelaide, Waite Campus, Glen Osmond, SA 5064, Australia. andreas.schreiber@adelaide.edu.au

ABSTRACT

Background: Both microarrays and quantitative real-time PCR are convenient tools for studying the transcriptional levels of genes. The former is preferable for large scale studies while the latter is a more targeted technique. Because of platform-dependent systematic effects, simple comparisons or merging of datasets obtained by these technologies are difficult, even though they may often be desirable. These difficulties are exacerbated if there is only partial overlap between the experimental conditions and genes probed in the two datasets.

Results: We show here that the generalized singular value decomposition provides a practical tool for merging a small, targeted dataset obtained by quantitative real-time PCR of specific genes with a much larger microarray dataset. The technique permits, for the first time, the identification of genes present in only one dataset co-expressed with a target gene present exclusively in the other dataset, even when experimental conditions for the two datasets are not identical. With the rapidly increasing number of publically available large scale microarray datasets the latter is frequently the case. The method enables us to discover putative candidate genes involved in the biosynthesis of the (1,3;1,4)-beta-D-glucan polysaccharide found in plant cell walls.

Conclusion: We show that the generalized singular value decomposition provides a viable tool for a combined analysis of two gene expression datasets with only partial overlap of both gene sets and experimental conditions. We illustrate how the decomposition can be optimized self-consistently by using a judicious choice of genes to define it. The ability of the technique to seamlessly define a concept of "co-expression" across both datasets provides an avenue for meaningful data integration. We believe that it will prove to be particularly useful for exploiting large, publicly available, microarray datasets for species with unsequenced genomes by complementing them with more limited in-house expression measurements.

Show MeSH
Related in: MedlinePlus