Limits...
Combining transcriptional datasets using the generalized singular value decomposition.

Schreiber AW, Shirley NJ, Burton RA, Fincher GB - BMC Bioinformatics (2008)

Bottom Line: Because of platform-dependent systematic effects, simple comparisons or merging of datasets obtained by these technologies are difficult, even though they may often be desirable.The method enables us to discover putative candidate genes involved in the biosynthesis of the (1,3;1,4)-beta-D-glucan polysaccharide found in plant cell walls.We believe that it will prove to be particularly useful for exploiting large, publicly available, microarray datasets for species with unsequenced genomes by complementing them with more limited in-house expression measurements.

View Article: PubMed Central - HTML - PubMed

Affiliation: Australian Centre for Plant Functional Genomics, School of Agriculture and Wine, University of Adelaide, Waite Campus, Glen Osmond, SA 5064, Australia. andreas.schreiber@adelaide.edu.au

ABSTRACT

Background: Both microarrays and quantitative real-time PCR are convenient tools for studying the transcriptional levels of genes. The former is preferable for large scale studies while the latter is a more targeted technique. Because of platform-dependent systematic effects, simple comparisons or merging of datasets obtained by these technologies are difficult, even though they may often be desirable. These difficulties are exacerbated if there is only partial overlap between the experimental conditions and genes probed in the two datasets.

Results: We show here that the generalized singular value decomposition provides a practical tool for merging a small, targeted dataset obtained by quantitative real-time PCR of specific genes with a much larger microarray dataset. The technique permits, for the first time, the identification of genes present in only one dataset co-expressed with a target gene present exclusively in the other dataset, even when experimental conditions for the two datasets are not identical. With the rapidly increasing number of publically available large scale microarray datasets the latter is frequently the case. The method enables us to discover putative candidate genes involved in the biosynthesis of the (1,3;1,4)-beta-D-glucan polysaccharide found in plant cell walls.

Conclusion: We show that the generalized singular value decomposition provides a viable tool for a combined analysis of two gene expression datasets with only partial overlap of both gene sets and experimental conditions. We illustrate how the decomposition can be optimized self-consistently by using a judicious choice of genes to define it. The ability of the technique to seamlessly define a concept of "co-expression" across both datasets provides an avenue for meaningful data integration. We believe that it will prove to be particularly useful for exploiting large, publicly available, microarray datasets for species with unsequenced genomes by complementing them with more limited in-house expression measurements.

Show MeSH

Related in: MedlinePlus

The microarray and Q-PCR datasets. The potential overlap of the microarray and Q-PCR datasets is depicted here. 'Region A' generically refers to the overlap between the datasets, 'Region B' to the part unique to the microarray and 'Region C' to that part unique to the Q-PCR series. It is clear from the context whether references to these regions in the main text refer to genes (top panel) or tissues probed (bottom panel). While up to 59 genes are simultaneously probed by both platforms, one would not necessarily expect their expression profiles to be identical in every case: differential contributions from (unknown) paralogs and/or closely related gene family members, as well as alternative splicing, can lead to distortions because the probes on the microarray and the primers used for the Q-PCR generally target different regions of a gene. Similarly, some tissues are probed simultaneously using both platforms, while others can be found in only one of the datasets. For a few tissues the overlap is hard to determine due to possible differences in developmental stage (shown in brackets; dgs = day old germinating seedling, dap = days after pollination, dba = days before anthesis, s = 10 cm seedling, ba = before anthesis). Further details about the tissues probed with the microarray may be found in Ref. [22], while details about those included in the Q-PCR series can be found in Ref. [48].
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2562393&req=5

Figure 1: The microarray and Q-PCR datasets. The potential overlap of the microarray and Q-PCR datasets is depicted here. 'Region A' generically refers to the overlap between the datasets, 'Region B' to the part unique to the microarray and 'Region C' to that part unique to the Q-PCR series. It is clear from the context whether references to these regions in the main text refer to genes (top panel) or tissues probed (bottom panel). While up to 59 genes are simultaneously probed by both platforms, one would not necessarily expect their expression profiles to be identical in every case: differential contributions from (unknown) paralogs and/or closely related gene family members, as well as alternative splicing, can lead to distortions because the probes on the microarray and the primers used for the Q-PCR generally target different regions of a gene. Similarly, some tissues are probed simultaneously using both platforms, while others can be found in only one of the datasets. For a few tissues the overlap is hard to determine due to possible differences in developmental stage (shown in brackets; dgs = day old germinating seedling, dap = days after pollination, dba = days before anthesis, s = 10 cm seedling, ba = before anthesis). Further details about the tissues probed with the microarray may be found in Ref. [22], while details about those included in the Q-PCR series can be found in Ref. [48].

Mentions: The Q-PCR based dataset, on the other hand, was taken from a series of 11 barley tissues, from the cultivar 'Sloop'. This dataset contains expression data for almost 80 genes that are mostly related to the synthesis or modification of cell wall polysaccharides. A number of these genes are only transcribed at relatively low levels and so it is not surprising that, while some of them are represented on the Barley1 chip, quite a few are unique to the Q-PCR dataset. The Q-PCR technique is more suited for detailed, targeted studies of genes of particular interest and, in contrast to the microarray dataset, it can be considered to be an 'open' dataset: it can easily be enlarged through the design of additional primers. For details on this Q-PCR data we refer to the Methods section as well as to the additional material [see Additional file 1]. The relationship between the genes and tissues probed in the two datasets is summarized in Figure 1.


Combining transcriptional datasets using the generalized singular value decomposition.

Schreiber AW, Shirley NJ, Burton RA, Fincher GB - BMC Bioinformatics (2008)

The microarray and Q-PCR datasets. The potential overlap of the microarray and Q-PCR datasets is depicted here. 'Region A' generically refers to the overlap between the datasets, 'Region B' to the part unique to the microarray and 'Region C' to that part unique to the Q-PCR series. It is clear from the context whether references to these regions in the main text refer to genes (top panel) or tissues probed (bottom panel). While up to 59 genes are simultaneously probed by both platforms, one would not necessarily expect their expression profiles to be identical in every case: differential contributions from (unknown) paralogs and/or closely related gene family members, as well as alternative splicing, can lead to distortions because the probes on the microarray and the primers used for the Q-PCR generally target different regions of a gene. Similarly, some tissues are probed simultaneously using both platforms, while others can be found in only one of the datasets. For a few tissues the overlap is hard to determine due to possible differences in developmental stage (shown in brackets; dgs = day old germinating seedling, dap = days after pollination, dba = days before anthesis, s = 10 cm seedling, ba = before anthesis). Further details about the tissues probed with the microarray may be found in Ref. [22], while details about those included in the Q-PCR series can be found in Ref. [48].
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2562393&req=5

Figure 1: The microarray and Q-PCR datasets. The potential overlap of the microarray and Q-PCR datasets is depicted here. 'Region A' generically refers to the overlap between the datasets, 'Region B' to the part unique to the microarray and 'Region C' to that part unique to the Q-PCR series. It is clear from the context whether references to these regions in the main text refer to genes (top panel) or tissues probed (bottom panel). While up to 59 genes are simultaneously probed by both platforms, one would not necessarily expect their expression profiles to be identical in every case: differential contributions from (unknown) paralogs and/or closely related gene family members, as well as alternative splicing, can lead to distortions because the probes on the microarray and the primers used for the Q-PCR generally target different regions of a gene. Similarly, some tissues are probed simultaneously using both platforms, while others can be found in only one of the datasets. For a few tissues the overlap is hard to determine due to possible differences in developmental stage (shown in brackets; dgs = day old germinating seedling, dap = days after pollination, dba = days before anthesis, s = 10 cm seedling, ba = before anthesis). Further details about the tissues probed with the microarray may be found in Ref. [22], while details about those included in the Q-PCR series can be found in Ref. [48].
Mentions: The Q-PCR based dataset, on the other hand, was taken from a series of 11 barley tissues, from the cultivar 'Sloop'. This dataset contains expression data for almost 80 genes that are mostly related to the synthesis or modification of cell wall polysaccharides. A number of these genes are only transcribed at relatively low levels and so it is not surprising that, while some of them are represented on the Barley1 chip, quite a few are unique to the Q-PCR dataset. The Q-PCR technique is more suited for detailed, targeted studies of genes of particular interest and, in contrast to the microarray dataset, it can be considered to be an 'open' dataset: it can easily be enlarged through the design of additional primers. For details on this Q-PCR data we refer to the Methods section as well as to the additional material [see Additional file 1]. The relationship between the genes and tissues probed in the two datasets is summarized in Figure 1.

Bottom Line: Because of platform-dependent systematic effects, simple comparisons or merging of datasets obtained by these technologies are difficult, even though they may often be desirable.The method enables us to discover putative candidate genes involved in the biosynthesis of the (1,3;1,4)-beta-D-glucan polysaccharide found in plant cell walls.We believe that it will prove to be particularly useful for exploiting large, publicly available, microarray datasets for species with unsequenced genomes by complementing them with more limited in-house expression measurements.

View Article: PubMed Central - HTML - PubMed

Affiliation: Australian Centre for Plant Functional Genomics, School of Agriculture and Wine, University of Adelaide, Waite Campus, Glen Osmond, SA 5064, Australia. andreas.schreiber@adelaide.edu.au

ABSTRACT

Background: Both microarrays and quantitative real-time PCR are convenient tools for studying the transcriptional levels of genes. The former is preferable for large scale studies while the latter is a more targeted technique. Because of platform-dependent systematic effects, simple comparisons or merging of datasets obtained by these technologies are difficult, even though they may often be desirable. These difficulties are exacerbated if there is only partial overlap between the experimental conditions and genes probed in the two datasets.

Results: We show here that the generalized singular value decomposition provides a practical tool for merging a small, targeted dataset obtained by quantitative real-time PCR of specific genes with a much larger microarray dataset. The technique permits, for the first time, the identification of genes present in only one dataset co-expressed with a target gene present exclusively in the other dataset, even when experimental conditions for the two datasets are not identical. With the rapidly increasing number of publically available large scale microarray datasets the latter is frequently the case. The method enables us to discover putative candidate genes involved in the biosynthesis of the (1,3;1,4)-beta-D-glucan polysaccharide found in plant cell walls.

Conclusion: We show that the generalized singular value decomposition provides a viable tool for a combined analysis of two gene expression datasets with only partial overlap of both gene sets and experimental conditions. We illustrate how the decomposition can be optimized self-consistently by using a judicious choice of genes to define it. The ability of the technique to seamlessly define a concept of "co-expression" across both datasets provides an avenue for meaningful data integration. We believe that it will prove to be particularly useful for exploiting large, publicly available, microarray datasets for species with unsequenced genomes by complementing them with more limited in-house expression measurements.

Show MeSH
Related in: MedlinePlus