Limits...
Empirical study of supervised gene screening.

Ma S - BMC Bioinformatics (2006)

Bottom Line: Supervised gene screening based on marginal gene ranking is commonly used to reduce the number of genes in the model building.Our study is partly motivated by the differences in gene discovery results caused by using different supervised gene screening methods.From a gene discovery point of view, the effect of supervised gene screening based on different marginal statistics cannot be ignored.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Epidemiology and Public Health, Yale University, New Haven, CT 06520, USA. shuangge.ma@yale.edu

ABSTRACT

Background: Microarray studies provide a way of linking variations of phenotypes with their genetic causations. Constructing predictive models using high dimensional microarray measurements usually consists of three steps: (1) unsupervised gene screening; (2) supervised gene screening; and (3) statistical model building. Supervised gene screening based on marginal gene ranking is commonly used to reduce the number of genes in the model building. Various simple statistics, such as t-statistic or signal to noise ratio, have been used to rank genes in the supervised screening. Despite of its extensive usage, statistical study of supervised gene screening remains scarce. Our study is partly motivated by the differences in gene discovery results caused by using different supervised gene screening methods.

Results: We investigate concordance and reproducibility of supervised gene screening based on eight commonly used marginal statistics. Concordance is assessed by the relative fractions of overlaps between top ranked genes screened using different marginal statistics. We propose a Bootstrap Reproducibility Index, which measures reproducibility of individual genes under the supervised screening. Empirical studies are based on four public microarray data. We consider the cases where the top 20%, 40% and 60% genes are screened.

Conclusion: From a gene discovery point of view, the effect of supervised gene screening based on different marginal statistics cannot be ignored. Empirical studies show that (1) genes passed different supervised screenings may be considerably different; (2) concordance may vary, depending on the underlying data structure and percentage of selected genes; (3) evaluated with the Bootstrap Reproducibility Index, genes passed supervised screenings are only moderately reproducible; and (4) concordance cannot be improved by supervised screening based on reproducibility.

Show MeSH
Empirical study: validity of supervised gene screening. The percentages of individual genes being included in the 20% top ranked genes computed from 1000 bootstrap samples.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1764766&req=5

Figure 1: Empirical study: validity of supervised gene screening. The percentages of individual genes being included in the 20% top ranked genes computed from 1000 bootstrap samples.

Mentions: The percentage computed here is closely related to the Bootstrap Reproducibility Index proposed below. We choose statistics 2, 5 and 8 (see the Methods section) as examples and show the percentage plot in Figure 1. Note that in Figure 1 we sort the genes based on decreasing percentages. We can see from Figure 1 that the percentages are far from being flat: there are some genes with very high percentages of being selected, which indicates that the supervised screening is reasonably reproducible. Studies with other screening statistics and other datasets show similar results and are omitted here.


Empirical study of supervised gene screening.

Ma S - BMC Bioinformatics (2006)

Empirical study: validity of supervised gene screening. The percentages of individual genes being included in the 20% top ranked genes computed from 1000 bootstrap samples.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1764766&req=5

Figure 1: Empirical study: validity of supervised gene screening. The percentages of individual genes being included in the 20% top ranked genes computed from 1000 bootstrap samples.
Mentions: The percentage computed here is closely related to the Bootstrap Reproducibility Index proposed below. We choose statistics 2, 5 and 8 (see the Methods section) as examples and show the percentage plot in Figure 1. Note that in Figure 1 we sort the genes based on decreasing percentages. We can see from Figure 1 that the percentages are far from being flat: there are some genes with very high percentages of being selected, which indicates that the supervised screening is reasonably reproducible. Studies with other screening statistics and other datasets show similar results and are omitted here.

Bottom Line: Supervised gene screening based on marginal gene ranking is commonly used to reduce the number of genes in the model building.Our study is partly motivated by the differences in gene discovery results caused by using different supervised gene screening methods.From a gene discovery point of view, the effect of supervised gene screening based on different marginal statistics cannot be ignored.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Epidemiology and Public Health, Yale University, New Haven, CT 06520, USA. shuangge.ma@yale.edu

ABSTRACT

Background: Microarray studies provide a way of linking variations of phenotypes with their genetic causations. Constructing predictive models using high dimensional microarray measurements usually consists of three steps: (1) unsupervised gene screening; (2) supervised gene screening; and (3) statistical model building. Supervised gene screening based on marginal gene ranking is commonly used to reduce the number of genes in the model building. Various simple statistics, such as t-statistic or signal to noise ratio, have been used to rank genes in the supervised screening. Despite of its extensive usage, statistical study of supervised gene screening remains scarce. Our study is partly motivated by the differences in gene discovery results caused by using different supervised gene screening methods.

Results: We investigate concordance and reproducibility of supervised gene screening based on eight commonly used marginal statistics. Concordance is assessed by the relative fractions of overlaps between top ranked genes screened using different marginal statistics. We propose a Bootstrap Reproducibility Index, which measures reproducibility of individual genes under the supervised screening. Empirical studies are based on four public microarray data. We consider the cases where the top 20%, 40% and 60% genes are screened.

Conclusion: From a gene discovery point of view, the effect of supervised gene screening based on different marginal statistics cannot be ignored. Empirical studies show that (1) genes passed different supervised screenings may be considerably different; (2) concordance may vary, depending on the underlying data structure and percentage of selected genes; (3) evaluated with the Bootstrap Reproducibility Index, genes passed supervised screenings are only moderately reproducible; and (4) concordance cannot be improved by supervised screening based on reproducibility.

Show MeSH