The choice of distributions for detecting gene-gene interactions in genome-wide association studies.
Bottom Line:
This is because screening and modeling may change the distribution used in hypothesis testing.To choose appropriate distributions, we suggest to use the permutation test or testing on the independent data set.The permutation test or testing on the independent data set can help choosing appropriate distributions in hypothesis testing, which provides more reliable results in practice.
Affiliation: Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong. eeyang@ust.hk
ABSTRACT
Show MeSH
Background: In genome-wide association studies (GWAS), the number of single-nucleotide polymorphisms (SNPs) typically ranges between 500,000 and 1,000,000. Accordingly, detecting gene-gene interactions in GWAS is computationally challenging because it involves hundreds of billions of SNP pairs. Stage-wise strategies are often used to overcome the computational difficulty. In the first stage, fast screening methods (e.g. Tuning ReliefF) are applied to reduce the whole SNP set to a small subset. In the second stage, sophisticated modeling methods (e.g., multifactor-dimensionality reduction (MDR)) are applied to the subset of SNPs to identify interesting interaction models and the corresponding interaction patterns. In the third stage, the significance of the identified interaction patterns is evaluated by hypothesis testing. Results: In this paper, we show that this stage-wise strategy could be problematic in controlling the false positive rate if the distribution is not appropriately chosen. This is because screening and modeling may change the distribution used in hypothesis testing. In our simulation study, we use some popular screening methods and the popular modeling method MDR as examples to show the effect of the inappropriate choice of distributions. To choose appropriate distributions, we suggest to use the permutation test or testing on the independent data set. We demonstrate their performance using synthetic data and a real genome wide data set from an Aged-related Macular Degeneration (AMD) study. Conclusions: The permutation test or testing on the independent data set can help choosing appropriate distributions in hypothesis testing, which provides more reliable results in practice. Related in: MedlinePlus |
Related In:
Results -
Collection
License getmorefigures.php?uid=PMC3044281&req=5
Mentions: In the first scenario, we generate data sets containing two, three and four SNPs for the settings d = 2, 3, 4, respectively. All SNPs are generated using the Hardy-Weinberg principle with minor allele frequencies uniformly distributed in [0.05,0.5]. By doing so, the MDR model can be directly fitted without search. Let us take d = 2 as an example. For each data set, we first obtain a genotype contingency table as shown in Table 1, and then collapse it into a 2 × 2 contingency table. Next we conduct the statistical test (either the Pearson χ2 test or the likelihood ratio test can be used since their difference is ignorable). The histogram of the statistics forms the distribution. For d = 3,4, this can be done in the same way. The histograms of these distributions obtained from 500 data sets are shown in the upper panel of Figure 2. We observe that the distributions of MDR (without search) follow the χ2 distributions. The estimated degrees of freedom of the χ2 distributions are df = 4.84, df = 11.40 and df = 30.41 for d = 2, d = 3 and d = 4, respectively. The non-interger degree of freedom is well defined, see [31-33]. This clearly indicates that is not an appropriate distribution for MDR. |
View Article: PubMed Central - HTML - PubMed
Affiliation: Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong. eeyang@ust.hk
Background: In genome-wide association studies (GWAS), the number of single-nucleotide polymorphisms (SNPs) typically ranges between 500,000 and 1,000,000. Accordingly, detecting gene-gene interactions in GWAS is computationally challenging because it involves hundreds of billions of SNP pairs. Stage-wise strategies are often used to overcome the computational difficulty. In the first stage, fast screening methods (e.g. Tuning ReliefF) are applied to reduce the whole SNP set to a small subset. In the second stage, sophisticated modeling methods (e.g., multifactor-dimensionality reduction (MDR)) are applied to the subset of SNPs to identify interesting interaction models and the corresponding interaction patterns. In the third stage, the significance of the identified interaction patterns is evaluated by hypothesis testing.
Results: In this paper, we show that this stage-wise strategy could be problematic in controlling the false positive rate if the distribution is not appropriately chosen. This is because screening and modeling may change the distribution used in hypothesis testing. In our simulation study, we use some popular screening methods and the popular modeling method MDR as examples to show the effect of the inappropriate choice of distributions. To choose appropriate distributions, we suggest to use the permutation test or testing on the independent data set. We demonstrate their performance using synthetic data and a real genome wide data set from an Aged-related Macular Degeneration (AMD) study.
Conclusions: The permutation test or testing on the independent data set can help choosing appropriate distributions in hypothesis testing, which provides more reliable results in practice.