Limits...
Further improvements to linear mixed models for genome-wide association studies.

Widmer C, Lippert C, Weissbrod O, Fusi N, Kadie C, Davidson R, Listgarten J, Heckerman D - Sci Rep (2014)

Bottom Line: Traditionally, all available SNPs are used to estimate the GSM.Specifically, when only population structure is present, a GSM constructed from SNPs that well predict the phenotype in combination with principal components as covariates controls type I error and yields more power than the traditional LMM.In any setting, with or without population structure or family relatedness, a GSM consisting of a mixture of two component GSMs, one constructed from all SNPs and another constructed from SNPs that well predict the phenotype again controls type I error and yields more power than the traditional LMM.

View Article: PubMed Central - PubMed

Affiliation: eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite PH1, Los Angeles, CA, 90024, United States.

ABSTRACT
We examine improvements to the linear mixed model (LMM) that better correct for population structure and family relatedness in genome-wide association studies (GWAS). LMMs rely on the estimation of a genetic similarity matrix (GSM), which encodes the pairwise similarity between every two individuals in a cohort. These similarities are estimated from single nucleotide polymorphisms (SNPs) or other genetic variants. Traditionally, all available SNPs are used to estimate the GSM. In empirical studies across a wide range of synthetic and real data, we find that modifications to this approach improve GWAS performance as measured by type I error control and power. Specifically, when only population structure is present, a GSM constructed from SNPs that well predict the phenotype in combination with principal components as covariates controls type I error and yields more power than the traditional LMM. In any setting, with or without population structure or family relatedness, a GSM consisting of a mixture of two component GSMs, one constructed from all SNPs and another constructed from SNPs that well predict the phenotype again controls type I error and yields more power than the traditional LMM. Software implementing these improvements and the experimental comparisons are available at http://microsoft.com/science.

Show MeSH
Empirical type I error rate and power for no population or family relatedness withpurely synthetic data.Type I error rate is plotted as a function of P value cutoff α. Eachpoint represents the average type I error rate or power across 18 data sets withdifferent degrees of signal (narrow-sense heritability). Shading on the curves for typeI error rate represent the 95% confidence interval assuming type I error is controlled.All power curves that are visibly separated have significant differences between them.For example, comparing power for Linreg and LMM(all) for 10 causal SNPs at a type Ierror of 10−3, the P value from a two-sided binomial test appliedto the number of true positives is 0.03.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4230738&req=5

f1: Empirical type I error rate and power for no population or family relatedness withpurely synthetic data.Type I error rate is plotted as a function of P value cutoff α. Eachpoint represents the average type I error rate or power across 18 data sets withdifferent degrees of signal (narrow-sense heritability). Shading on the curves for typeI error rate represent the 95% confidence interval assuming type I error is controlled.All power curves that are visibly separated have significant differences between them.For example, comparing power for Linreg and LMM(all) for 10 causal SNPs at a type Ierror of 10−3, the P value from a two-sided binomial test appliedto the number of true positives is 0.03.

Mentions: For each of the three models, we measured empirical type I error rate (the proportionof non-causal SNPs deemed significant) as a function of P-value threshold. Inaddition, we measured empirical power (the proportion of causal SNPs deemed significant)as a function of empirical type I error (Figure 1; note that afourth method shown in the figure, LMM(all + select), will be introduced in thefollowing section). Results are shown for different numbers of causal SNPs. All modelscontrolled type I error well. Furthermore, LMM(select) yielded the most power,especially when the number of causal SNPs was small (and thus the effect sizes werelarge). That LMM(select) had more power than Linreg is not surprising when thinkingabout the LMM as linear regression with selected SNPs as covariates. Namely,conditioning on selected SNPs reduces noise in the phenotype. That LMM(select) had morepower than LMM(all) when there were few causal SNPs is also expected, as presumably theuse of all SNPs in the GSM obfuscated the true causal signal, a phenomenon called“dilution”7.


Further improvements to linear mixed models for genome-wide association studies.

Widmer C, Lippert C, Weissbrod O, Fusi N, Kadie C, Davidson R, Listgarten J, Heckerman D - Sci Rep (2014)

Empirical type I error rate and power for no population or family relatedness withpurely synthetic data.Type I error rate is plotted as a function of P value cutoff α. Eachpoint represents the average type I error rate or power across 18 data sets withdifferent degrees of signal (narrow-sense heritability). Shading on the curves for typeI error rate represent the 95% confidence interval assuming type I error is controlled.All power curves that are visibly separated have significant differences between them.For example, comparing power for Linreg and LMM(all) for 10 causal SNPs at a type Ierror of 10−3, the P value from a two-sided binomial test appliedto the number of true positives is 0.03.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4230738&req=5

f1: Empirical type I error rate and power for no population or family relatedness withpurely synthetic data.Type I error rate is plotted as a function of P value cutoff α. Eachpoint represents the average type I error rate or power across 18 data sets withdifferent degrees of signal (narrow-sense heritability). Shading on the curves for typeI error rate represent the 95% confidence interval assuming type I error is controlled.All power curves that are visibly separated have significant differences between them.For example, comparing power for Linreg and LMM(all) for 10 causal SNPs at a type Ierror of 10−3, the P value from a two-sided binomial test appliedto the number of true positives is 0.03.
Mentions: For each of the three models, we measured empirical type I error rate (the proportionof non-causal SNPs deemed significant) as a function of P-value threshold. Inaddition, we measured empirical power (the proportion of causal SNPs deemed significant)as a function of empirical type I error (Figure 1; note that afourth method shown in the figure, LMM(all + select), will be introduced in thefollowing section). Results are shown for different numbers of causal SNPs. All modelscontrolled type I error well. Furthermore, LMM(select) yielded the most power,especially when the number of causal SNPs was small (and thus the effect sizes werelarge). That LMM(select) had more power than Linreg is not surprising when thinkingabout the LMM as linear regression with selected SNPs as covariates. Namely,conditioning on selected SNPs reduces noise in the phenotype. That LMM(select) had morepower than LMM(all) when there were few causal SNPs is also expected, as presumably theuse of all SNPs in the GSM obfuscated the true causal signal, a phenomenon called“dilution”7.

Bottom Line: Traditionally, all available SNPs are used to estimate the GSM.Specifically, when only population structure is present, a GSM constructed from SNPs that well predict the phenotype in combination with principal components as covariates controls type I error and yields more power than the traditional LMM.In any setting, with or without population structure or family relatedness, a GSM consisting of a mixture of two component GSMs, one constructed from all SNPs and another constructed from SNPs that well predict the phenotype again controls type I error and yields more power than the traditional LMM.

View Article: PubMed Central - PubMed

Affiliation: eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite PH1, Los Angeles, CA, 90024, United States.

ABSTRACT
We examine improvements to the linear mixed model (LMM) that better correct for population structure and family relatedness in genome-wide association studies (GWAS). LMMs rely on the estimation of a genetic similarity matrix (GSM), which encodes the pairwise similarity between every two individuals in a cohort. These similarities are estimated from single nucleotide polymorphisms (SNPs) or other genetic variants. Traditionally, all available SNPs are used to estimate the GSM. In empirical studies across a wide range of synthetic and real data, we find that modifications to this approach improve GWAS performance as measured by type I error control and power. Specifically, when only population structure is present, a GSM constructed from SNPs that well predict the phenotype in combination with principal components as covariates controls type I error and yields more power than the traditional LMM. In any setting, with or without population structure or family relatedness, a GSM consisting of a mixture of two component GSMs, one constructed from all SNPs and another constructed from SNPs that well predict the phenotype again controls type I error and yields more power than the traditional LMM. Software implementing these improvements and the experimental comparisons are available at http://microsoft.com/science.

Show MeSH