Limits...
A comparison of gene region simulation methods.

Hendricks AE, Dupuis J, Gupta M, Logue MW, Lunetta KL - PLoS ONE (2012)

Bottom Line: When possible, we also evaluate the effects of changing parameters.We recommend using Hapgen to simulate replicate haplotypes from a gene region.Hapgen produces moderate sampling variation between the replicates while retaining the overall unique LD structure of the gene region.

View Article: PubMed Central - PubMed

Affiliation: Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, United States of America. baera@bu.edu

ABSTRACT

Background: Accurately modeling LD in simulations is essential to correctly evaluate new and existing association methods. At present, there has been minimal research comparing the quality of existing gene region simulation methods to produce LD structures similar to an existing gene region. Here we compare the ability of three approaches to accurately simulate the LD within a gene region: HapSim (2005), Hapgen (2009), and a minor extension to simple haplotype resampling.

Methodology/principal findings: In order to observe the variation and bias for each method, we compare the simulated pairwise LD measures and minor allele frequencies to the original HapMap data in an extensive simulation study. When possible, we also evaluate the effects of changing parameters. HapSim produces samples of haplotypes with lower LD, on average, compared to the original haplotype set while both our resampling method and Hapgen do not introduce this bias. The variation introduced across the replicates by our resampling method is quite small and may not provide enough sampling variability to make a generalizable simulation study.

Conclusion: We recommend using Hapgen to simulate replicate haplotypes from a gene region. Hapgen produces moderate sampling variation between the replicates while retaining the overall unique LD structure of the gene region.

Show MeSH

Related in: MedlinePlus

Affect of dichotomizing on the correlation between two normally distributed variables.Correlation between dichotomized variables compared to the original correlation between two normally distributed variables. Each curve represents an original correlation value (ρ = 0.1 for the bottom curve to ρ = 0.9 for the top curve by 0.1). The same cut point was used for both variables.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3399793&req=5

pone-0040925-g006: Affect of dichotomizing on the correlation between two normally distributed variables.Correlation between dichotomized variables compared to the original correlation between two normally distributed variables. Each curve represents an original correlation value (ρ = 0.1 for the bottom curve to ρ = 0.9 for the top curve by 0.1). The same cut point was used for both variables.

Mentions: Although having to approximate a positive definite version of a matrix when the calculated covariance matrix is not positive definite will introduce error, there is no indication that the error produces a consistent bias. Rather, the error will likely increase the variance. Another, more likely explanation for HapSim’s loss in LD is dichotomizing the vectors of normally distributed variables to vectors of binary values to create the haplotypes. It has been previously shown that dichotomizing normally distributed variables into binary variables decreases the correlation between the variables [26]. To further support this, we calculated the correlation before and after dichotomizing two normally distributed variables. In Figure 6 we show a loss in correlation when dichotomizing using the same threshold for each variable. We continued to see a loss in correlation when different thresholds were used for each variable (Figure S10). Thus, dichotomizing the normally distributed simulated haplotypes is likely the reason the LD decreased for HapSim across the region for the simulated replicates compared to the original sample.


A comparison of gene region simulation methods.

Hendricks AE, Dupuis J, Gupta M, Logue MW, Lunetta KL - PLoS ONE (2012)

Affect of dichotomizing on the correlation between two normally distributed variables.Correlation between dichotomized variables compared to the original correlation between two normally distributed variables. Each curve represents an original correlation value (ρ = 0.1 for the bottom curve to ρ = 0.9 for the top curve by 0.1). The same cut point was used for both variables.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3399793&req=5

pone-0040925-g006: Affect of dichotomizing on the correlation between two normally distributed variables.Correlation between dichotomized variables compared to the original correlation between two normally distributed variables. Each curve represents an original correlation value (ρ = 0.1 for the bottom curve to ρ = 0.9 for the top curve by 0.1). The same cut point was used for both variables.
Mentions: Although having to approximate a positive definite version of a matrix when the calculated covariance matrix is not positive definite will introduce error, there is no indication that the error produces a consistent bias. Rather, the error will likely increase the variance. Another, more likely explanation for HapSim’s loss in LD is dichotomizing the vectors of normally distributed variables to vectors of binary values to create the haplotypes. It has been previously shown that dichotomizing normally distributed variables into binary variables decreases the correlation between the variables [26]. To further support this, we calculated the correlation before and after dichotomizing two normally distributed variables. In Figure 6 we show a loss in correlation when dichotomizing using the same threshold for each variable. We continued to see a loss in correlation when different thresholds were used for each variable (Figure S10). Thus, dichotomizing the normally distributed simulated haplotypes is likely the reason the LD decreased for HapSim across the region for the simulated replicates compared to the original sample.

Bottom Line: When possible, we also evaluate the effects of changing parameters.We recommend using Hapgen to simulate replicate haplotypes from a gene region.Hapgen produces moderate sampling variation between the replicates while retaining the overall unique LD structure of the gene region.

View Article: PubMed Central - PubMed

Affiliation: Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, United States of America. baera@bu.edu

ABSTRACT

Background: Accurately modeling LD in simulations is essential to correctly evaluate new and existing association methods. At present, there has been minimal research comparing the quality of existing gene region simulation methods to produce LD structures similar to an existing gene region. Here we compare the ability of three approaches to accurately simulate the LD within a gene region: HapSim (2005), Hapgen (2009), and a minor extension to simple haplotype resampling.

Methodology/principal findings: In order to observe the variation and bias for each method, we compare the simulated pairwise LD measures and minor allele frequencies to the original HapMap data in an extensive simulation study. When possible, we also evaluate the effects of changing parameters. HapSim produces samples of haplotypes with lower LD, on average, compared to the original haplotype set while both our resampling method and Hapgen do not introduce this bias. The variation introduced across the replicates by our resampling method is quite small and may not provide enough sampling variability to make a generalizable simulation study.

Conclusion: We recommend using Hapgen to simulate replicate haplotypes from a gene region. Hapgen produces moderate sampling variation between the replicates while retaining the overall unique LD structure of the gene region.

Show MeSH
Related in: MedlinePlus