Limits...
Performance of random forest when SNPs are in linkage disequilibrium.

Meng YA, Yu Y, Cupples LA, Farrer LA, Lunetta KL - BMC Bioinformatics (2009)

Bottom Line: For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs.We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Medicine, Boston University, MA, USA. ymeng@broad.mit.edu

ABSTRACT

Background: Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF.

Results: We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.

Conclusion: Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.

Show MeSH
Mean of IM(rSNP), IM(rHAP) and IM(PRED.rHAP). The data were simulated using model H4M4 with K4S4N100 analysis design. The importance measures are the original importance measures (IM0) combined with the original RF.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2666661&req=5

Figure 4: Mean of IM(rSNP), IM(rHAP) and IM(PRED.rHAP). The data were simulated using model H4M4 with K4S4N100 analysis design. The importance measures are the original importance measures (IM0) combined with the original RF.

Mentions: Under the H4M4 model described in Table 1 with the K4S4N100 design, the IM for risk SNPs increased with increasing LD between the risk SNPs. The mean IM for risk haplotypes and predicted risk haplotypes was relatively stable as the LD between the risk SNPs increased. The results displayed in Figure 4 show that in general, the mean IM of the risk haplotype was higher than that of the predicted risk haplotype, which was higher than the mean IM of the risk SNPs that make up the haplotype used as independent predictors. The difference in IM among the three analysis options decreased as the strength of the LD between the risk SNPs in the risk haplotype increased (Figure 4). The proportion of replicates for which the IMs of all of the risk SNPs (for the SNP analysis) or risk haplotypes (for the haplotype methods) exceeded the maximum IM of the noise SNPs also increased as LD between the risk SNPs in the risk haplotype increased (Figure 5). The proportions of replicates for which all risk variables (risk SNPs or risk haplotype, depending on analysis method) were among the top-ranking X variables showed a similar trend (Figure 6). Since the two risk SNPs in the risk haplotype are correlated, identification of one should bring attention to the region. Therefore, we also examined the performance when the best risk SNP in the risk haplotype is considered. Importantly, the proportion of replicates for which at least one of the risk SNPs in each risk haplotype was among the top-ranking X variables was greater than the proportion of replicates where both risk SNPs in risk haplotype were among the top variables, and was also better than or close to the proportion for the risk haplotype or predicted haplotype as predictors. Due to the computational burden of calculating predicted haplotypes, the analyses using predicted haplotypes were performed only for the H4M4 model and K4S4N100 design, with 1000 trees for each random forest.


Performance of random forest when SNPs are in linkage disequilibrium.

Meng YA, Yu Y, Cupples LA, Farrer LA, Lunetta KL - BMC Bioinformatics (2009)

Mean of IM(rSNP), IM(rHAP) and IM(PRED.rHAP). The data were simulated using model H4M4 with K4S4N100 analysis design. The importance measures are the original importance measures (IM0) combined with the original RF.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2666661&req=5

Figure 4: Mean of IM(rSNP), IM(rHAP) and IM(PRED.rHAP). The data were simulated using model H4M4 with K4S4N100 analysis design. The importance measures are the original importance measures (IM0) combined with the original RF.
Mentions: Under the H4M4 model described in Table 1 with the K4S4N100 design, the IM for risk SNPs increased with increasing LD between the risk SNPs. The mean IM for risk haplotypes and predicted risk haplotypes was relatively stable as the LD between the risk SNPs increased. The results displayed in Figure 4 show that in general, the mean IM of the risk haplotype was higher than that of the predicted risk haplotype, which was higher than the mean IM of the risk SNPs that make up the haplotype used as independent predictors. The difference in IM among the three analysis options decreased as the strength of the LD between the risk SNPs in the risk haplotype increased (Figure 4). The proportion of replicates for which the IMs of all of the risk SNPs (for the SNP analysis) or risk haplotypes (for the haplotype methods) exceeded the maximum IM of the noise SNPs also increased as LD between the risk SNPs in the risk haplotype increased (Figure 5). The proportions of replicates for which all risk variables (risk SNPs or risk haplotype, depending on analysis method) were among the top-ranking X variables showed a similar trend (Figure 6). Since the two risk SNPs in the risk haplotype are correlated, identification of one should bring attention to the region. Therefore, we also examined the performance when the best risk SNP in the risk haplotype is considered. Importantly, the proportion of replicates for which at least one of the risk SNPs in each risk haplotype was among the top-ranking X variables was greater than the proportion of replicates where both risk SNPs in risk haplotype were among the top variables, and was also better than or close to the proportion for the risk haplotype or predicted haplotype as predictors. Due to the computational burden of calculating predicted haplotypes, the analyses using predicted haplotypes were performed only for the H4M4 model and K4S4N100 design, with 1000 trees for each random forest.

Bottom Line: For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs.We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Medicine, Boston University, MA, USA. ymeng@broad.mit.edu

ABSTRACT

Background: Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF.

Results: We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.

Conclusion: Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.

Show MeSH