Limits...
The choice of distributions for detecting gene-gene interactions in genome-wide association studies.

Yang C, Wan X, He Z, Yang Q, Xue H, Yu W - BMC Bioinformatics (2011)

Bottom Line: In the second stage, sophisticated modeling methods (e.g., multifactor-dimensionality reduction (MDR)) are applied to the subset of SNPs to identify interesting interaction models and the corresponding interaction patterns.In our simulation study, we use some popular screening methods and the popular modeling method MDR as examples to show the effect of the inappropriate choice of distributions.The permutation test or testing on the independent data set can help choosing appropriate distributions in hypothesis testing, which provides more reliable results in practice.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong. eeyang@ust.hk

ABSTRACT

Background: In genome-wide association studies (GWAS), the number of single-nucleotide polymorphisms (SNPs) typically ranges between 500,000 and 1,000,000. Accordingly, detecting gene-gene interactions in GWAS is computationally challenging because it involves hundreds of billions of SNP pairs. Stage-wise strategies are often used to overcome the computational difficulty. In the first stage, fast screening methods (e.g. Tuning ReliefF) are applied to reduce the whole SNP set to a small subset. In the second stage, sophisticated modeling methods (e.g., multifactor-dimensionality reduction (MDR)) are applied to the subset of SNPs to identify interesting interaction models and the corresponding interaction patterns. In the third stage, the significance of the identified interaction patterns is evaluated by hypothesis testing.

Results: In this paper, we show that this stage-wise strategy could be problematic in controlling the false positive rate if the distribution is not appropriately chosen. This is because screening and modeling may change the distribution used in hypothesis testing. In our simulation study, we use some popular screening methods and the popular modeling method MDR as examples to show the effect of the inappropriate choice of distributions. To choose appropriate distributions, we suggest to use the permutation test or testing on the independent data set. We demonstrate their performance using synthetic data and a real genome wide data set from an Aged-related Macular Degeneration (AMD) study.

Conclusions: The permutation test or testing on the independent data set can help choosing appropriate distributions in hypothesis testing, which provides more reliable results in practice.

Show MeSH

Related in: MedlinePlus

Null distributions affected by MDR modeling. The  distributions are estimated using 500 simulated  data sets. Each  data set contains n = 2000 samples. Upper panel: From left to right, each data set has L = 2, L = 3, L = 4 SNPs. MDR can be applied to these data sets without model search to fit the two-factor model (d = 2), the three-factor model (d = 3), and the four-factor model (d = 4). The resulting  distributions follows χ2 distributions with df = 4.84, 11.40, 30.41, respectively. Lower panel: Each  data set contains n = 2000 samples and L = 20 SNPs. MDR is directly applied to each data set. MDR searches all possible models and cross-validation is used to assess each model. The best two-factor model (d* = 2), the best three-factor model (d* = 3), and the best four-factor model (d* = 4) are identified. Their distributions, shown from left to right, do not strictly follow χ2 distributions.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3044281&req=5

Figure 2: Null distributions affected by MDR modeling. The distributions are estimated using 500 simulated data sets. Each data set contains n = 2000 samples. Upper panel: From left to right, each data set has L = 2, L = 3, L = 4 SNPs. MDR can be applied to these data sets without model search to fit the two-factor model (d = 2), the three-factor model (d = 3), and the four-factor model (d = 4). The resulting distributions follows χ2 distributions with df = 4.84, 11.40, 30.41, respectively. Lower panel: Each data set contains n = 2000 samples and L = 20 SNPs. MDR is directly applied to each data set. MDR searches all possible models and cross-validation is used to assess each model. The best two-factor model (d* = 2), the best three-factor model (d* = 3), and the best four-factor model (d* = 4) are identified. Their distributions, shown from left to right, do not strictly follow χ2 distributions.

Mentions: In the first scenario, we generate data sets containing two, three and four SNPs for the settings d = 2, 3, 4, respectively. All SNPs are generated using the Hardy-Weinberg principle with minor allele frequencies uniformly distributed in [0.05,0.5]. By doing so, the MDR model can be directly fitted without search. Let us take d = 2 as an example. For each data set, we first obtain a genotype contingency table as shown in Table 1, and then collapse it into a 2 × 2 contingency table. Next we conduct the statistical test (either the Pearson χ2 test or the likelihood ratio test can be used since their difference is ignorable). The histogram of the statistics forms the distribution. For d = 3,4, this can be done in the same way. The histograms of these distributions obtained from 500 data sets are shown in the upper panel of Figure 2. We observe that the distributions of MDR (without search) follow the χ2 distributions. The estimated degrees of freedom of the χ2 distributions are df = 4.84, df = 11.40 and df = 30.41 for d = 2, d = 3 and d = 4, respectively. The non-interger degree of freedom is well defined, see [31-33]. This clearly indicates that is not an appropriate distribution for MDR.


The choice of distributions for detecting gene-gene interactions in genome-wide association studies.

Yang C, Wan X, He Z, Yang Q, Xue H, Yu W - BMC Bioinformatics (2011)

Null distributions affected by MDR modeling. The  distributions are estimated using 500 simulated  data sets. Each  data set contains n = 2000 samples. Upper panel: From left to right, each data set has L = 2, L = 3, L = 4 SNPs. MDR can be applied to these data sets without model search to fit the two-factor model (d = 2), the three-factor model (d = 3), and the four-factor model (d = 4). The resulting  distributions follows χ2 distributions with df = 4.84, 11.40, 30.41, respectively. Lower panel: Each  data set contains n = 2000 samples and L = 20 SNPs. MDR is directly applied to each data set. MDR searches all possible models and cross-validation is used to assess each model. The best two-factor model (d* = 2), the best three-factor model (d* = 3), and the best four-factor model (d* = 4) are identified. Their distributions, shown from left to right, do not strictly follow χ2 distributions.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3044281&req=5

Figure 2: Null distributions affected by MDR modeling. The distributions are estimated using 500 simulated data sets. Each data set contains n = 2000 samples. Upper panel: From left to right, each data set has L = 2, L = 3, L = 4 SNPs. MDR can be applied to these data sets without model search to fit the two-factor model (d = 2), the three-factor model (d = 3), and the four-factor model (d = 4). The resulting distributions follows χ2 distributions with df = 4.84, 11.40, 30.41, respectively. Lower panel: Each data set contains n = 2000 samples and L = 20 SNPs. MDR is directly applied to each data set. MDR searches all possible models and cross-validation is used to assess each model. The best two-factor model (d* = 2), the best three-factor model (d* = 3), and the best four-factor model (d* = 4) are identified. Their distributions, shown from left to right, do not strictly follow χ2 distributions.
Mentions: In the first scenario, we generate data sets containing two, three and four SNPs for the settings d = 2, 3, 4, respectively. All SNPs are generated using the Hardy-Weinberg principle with minor allele frequencies uniformly distributed in [0.05,0.5]. By doing so, the MDR model can be directly fitted without search. Let us take d = 2 as an example. For each data set, we first obtain a genotype contingency table as shown in Table 1, and then collapse it into a 2 × 2 contingency table. Next we conduct the statistical test (either the Pearson χ2 test or the likelihood ratio test can be used since their difference is ignorable). The histogram of the statistics forms the distribution. For d = 3,4, this can be done in the same way. The histograms of these distributions obtained from 500 data sets are shown in the upper panel of Figure 2. We observe that the distributions of MDR (without search) follow the χ2 distributions. The estimated degrees of freedom of the χ2 distributions are df = 4.84, df = 11.40 and df = 30.41 for d = 2, d = 3 and d = 4, respectively. The non-interger degree of freedom is well defined, see [31-33]. This clearly indicates that is not an appropriate distribution for MDR.

Bottom Line: In the second stage, sophisticated modeling methods (e.g., multifactor-dimensionality reduction (MDR)) are applied to the subset of SNPs to identify interesting interaction models and the corresponding interaction patterns.In our simulation study, we use some popular screening methods and the popular modeling method MDR as examples to show the effect of the inappropriate choice of distributions.The permutation test or testing on the independent data set can help choosing appropriate distributions in hypothesis testing, which provides more reliable results in practice.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong. eeyang@ust.hk

ABSTRACT

Background: In genome-wide association studies (GWAS), the number of single-nucleotide polymorphisms (SNPs) typically ranges between 500,000 and 1,000,000. Accordingly, detecting gene-gene interactions in GWAS is computationally challenging because it involves hundreds of billions of SNP pairs. Stage-wise strategies are often used to overcome the computational difficulty. In the first stage, fast screening methods (e.g. Tuning ReliefF) are applied to reduce the whole SNP set to a small subset. In the second stage, sophisticated modeling methods (e.g., multifactor-dimensionality reduction (MDR)) are applied to the subset of SNPs to identify interesting interaction models and the corresponding interaction patterns. In the third stage, the significance of the identified interaction patterns is evaluated by hypothesis testing.

Results: In this paper, we show that this stage-wise strategy could be problematic in controlling the false positive rate if the distribution is not appropriately chosen. This is because screening and modeling may change the distribution used in hypothesis testing. In our simulation study, we use some popular screening methods and the popular modeling method MDR as examples to show the effect of the inappropriate choice of distributions. To choose appropriate distributions, we suggest to use the permutation test or testing on the independent data set. We demonstrate their performance using synthetic data and a real genome wide data set from an Aged-related Macular Degeneration (AMD) study.

Conclusions: The permutation test or testing on the independent data set can help choosing appropriate distributions in hypothesis testing, which provides more reliable results in practice.

Show MeSH
Related in: MedlinePlus