Limits...
Performance of random forest when SNPs are in linkage disequilibrium.

Meng YA, Yu Y, Cupples LA, Farrer LA, Lunetta KL - BMC Bioinformatics (2009)

Bottom Line: For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs.We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Medicine, Boston University, MA, USA. ymeng@broad.mit.edu

ABSTRACT

Background: Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF.

Results: We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.

Conclusion: Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.

Show MeSH
Proportion of replicates for which all rSNPs and/or LD.rSNPs are among the top-ranking X SNPs. The data were simulated using model H4M4 with K4S4 analysis design. (A) Top panels: the proportion of replicates for which each rSNP or one of its corresponding LD.rSNPs is among the top X SNPs ("rSNPs/LD.rSNPs"). Left panel: RF0:IM0; right panel: RF0:IM1. (B) Middle panels: using RF0:IM1, we compare the proportion of replicates for which each rSNP or at least one corresponding LD.rSNP was among the top-ranking X SNPs, the proportion considering only rSNPs ("rSNPs with LD.rSNPs"), and the proportion where there are no rSNPs in the dataset ("rSNPs, no LD.rSNPs"), with example of 1 SNP and 4 SNPs in LD with each rSNP. (C) Bottom panels: we compare the four combinations of original and revised RF and original and revised IM, with examples for 1 and 4 LD.rSNPs.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2666661&req=5

Figure 3: Proportion of replicates for which all rSNPs and/or LD.rSNPs are among the top-ranking X SNPs. The data were simulated using model H4M4 with K4S4 analysis design. (A) Top panels: the proportion of replicates for which each rSNP or one of its corresponding LD.rSNPs is among the top X SNPs ("rSNPs/LD.rSNPs"). Left panel: RF0:IM0; right panel: RF0:IM1. (B) Middle panels: using RF0:IM1, we compare the proportion of replicates for which each rSNP or at least one corresponding LD.rSNP was among the top-ranking X SNPs, the proportion considering only rSNPs ("rSNPs with LD.rSNPs"), and the proportion where there are no rSNPs in the dataset ("rSNPs, no LD.rSNPs"), with example of 1 SNP and 4 SNPs in LD with each rSNP. (C) Bottom panels: we compare the four combinations of original and revised RF and original and revised IM, with examples for 1 and 4 LD.rSNPs.

Mentions: For all models and all four combinations of original and revised RF and IM, the proportion of replicates for which each rSNP or at least one of its corresponding LD.rSNPs is among the top-ranking X SNPs is smaller when there is one or more LD.rSNP than the proportion when there are no LD.rSNPs when X is small (8~20) (Figure 3). However, as X increases to 20 or greater, the proportion for data sets with at least one LD.rSNP becomes larger than the proportion for data sets with no LD.rSNPs. We show this trend using the original IM and the revised IM with the original RF algorithm in Figure 3 (top panels). When we compare datasets with different number of LD.rSNPs, the proportion of replicates for which each rSNP or at least one corresponding LD.rSNP was among the top-ranking X SNPs was higher than the proportion considering only rSNPs in some situations. We see this in Figure 3 (middle panels) when X ≥ 8 and there is 1 LD.rSNP), and for X ≥ 20 when there are 4 LD.rSNPs. Thus, when considering the identification of LD.rSNPs equal to the identification of rSNPs, including the LD.rSNPs in the analysis is more powerful than using rSNPs alone under some conditions. In Figure 3, bottom panels, we compare the four combinations of original and revised RF and original and revised IM for 1 and 4 LD.rSNPs. For a fixed number of LD.rSNPs, the original RF had better performance than the revised RF, and the original and revised IM had nearly identical performance within each RF method.


Performance of random forest when SNPs are in linkage disequilibrium.

Meng YA, Yu Y, Cupples LA, Farrer LA, Lunetta KL - BMC Bioinformatics (2009)

Proportion of replicates for which all rSNPs and/or LD.rSNPs are among the top-ranking X SNPs. The data were simulated using model H4M4 with K4S4 analysis design. (A) Top panels: the proportion of replicates for which each rSNP or one of its corresponding LD.rSNPs is among the top X SNPs ("rSNPs/LD.rSNPs"). Left panel: RF0:IM0; right panel: RF0:IM1. (B) Middle panels: using RF0:IM1, we compare the proportion of replicates for which each rSNP or at least one corresponding LD.rSNP was among the top-ranking X SNPs, the proportion considering only rSNPs ("rSNPs with LD.rSNPs"), and the proportion where there are no rSNPs in the dataset ("rSNPs, no LD.rSNPs"), with example of 1 SNP and 4 SNPs in LD with each rSNP. (C) Bottom panels: we compare the four combinations of original and revised RF and original and revised IM, with examples for 1 and 4 LD.rSNPs.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2666661&req=5

Figure 3: Proportion of replicates for which all rSNPs and/or LD.rSNPs are among the top-ranking X SNPs. The data were simulated using model H4M4 with K4S4 analysis design. (A) Top panels: the proportion of replicates for which each rSNP or one of its corresponding LD.rSNPs is among the top X SNPs ("rSNPs/LD.rSNPs"). Left panel: RF0:IM0; right panel: RF0:IM1. (B) Middle panels: using RF0:IM1, we compare the proportion of replicates for which each rSNP or at least one corresponding LD.rSNP was among the top-ranking X SNPs, the proportion considering only rSNPs ("rSNPs with LD.rSNPs"), and the proportion where there are no rSNPs in the dataset ("rSNPs, no LD.rSNPs"), with example of 1 SNP and 4 SNPs in LD with each rSNP. (C) Bottom panels: we compare the four combinations of original and revised RF and original and revised IM, with examples for 1 and 4 LD.rSNPs.
Mentions: For all models and all four combinations of original and revised RF and IM, the proportion of replicates for which each rSNP or at least one of its corresponding LD.rSNPs is among the top-ranking X SNPs is smaller when there is one or more LD.rSNP than the proportion when there are no LD.rSNPs when X is small (8~20) (Figure 3). However, as X increases to 20 or greater, the proportion for data sets with at least one LD.rSNP becomes larger than the proportion for data sets with no LD.rSNPs. We show this trend using the original IM and the revised IM with the original RF algorithm in Figure 3 (top panels). When we compare datasets with different number of LD.rSNPs, the proportion of replicates for which each rSNP or at least one corresponding LD.rSNP was among the top-ranking X SNPs was higher than the proportion considering only rSNPs in some situations. We see this in Figure 3 (middle panels) when X ≥ 8 and there is 1 LD.rSNP), and for X ≥ 20 when there are 4 LD.rSNPs. Thus, when considering the identification of LD.rSNPs equal to the identification of rSNPs, including the LD.rSNPs in the analysis is more powerful than using rSNPs alone under some conditions. In Figure 3, bottom panels, we compare the four combinations of original and revised RF and original and revised IM for 1 and 4 LD.rSNPs. For a fixed number of LD.rSNPs, the original RF had better performance than the revised RF, and the original and revised IM had nearly identical performance within each RF method.

Bottom Line: For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs.We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Medicine, Boston University, MA, USA. ymeng@broad.mit.edu

ABSTRACT

Background: Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF.

Results: We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.

Conclusion: Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.

Show MeSH