Limits...
Further improvements to linear mixed models for genome-wide association studies.

Widmer C, Lippert C, Weissbrod O, Fusi N, Kadie C, Davidson R, Listgarten J, Heckerman D - Sci Rep (2014)

Bottom Line: Traditionally, all available SNPs are used to estimate the GSM.Specifically, when only population structure is present, a GSM constructed from SNPs that well predict the phenotype in combination with principal components as covariates controls type I error and yields more power than the traditional LMM.In any setting, with or without population structure or family relatedness, a GSM consisting of a mixture of two component GSMs, one constructed from all SNPs and another constructed from SNPs that well predict the phenotype again controls type I error and yields more power than the traditional LMM.

View Article: PubMed Central - PubMed

Affiliation: eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite PH1, Los Angeles, CA, 90024, United States.

ABSTRACT
We examine improvements to the linear mixed model (LMM) that better correct for population structure and family relatedness in genome-wide association studies (GWAS). LMMs rely on the estimation of a genetic similarity matrix (GSM), which encodes the pairwise similarity between every two individuals in a cohort. These similarities are estimated from single nucleotide polymorphisms (SNPs) or other genetic variants. Traditionally, all available SNPs are used to estimate the GSM. In empirical studies across a wide range of synthetic and real data, we find that modifications to this approach improve GWAS performance as measured by type I error control and power. Specifically, when only population structure is present, a GSM constructed from SNPs that well predict the phenotype in combination with principal components as covariates controls type I error and yields more power than the traditional LMM. In any setting, with or without population structure or family relatedness, a GSM consisting of a mixture of two component GSMs, one constructed from all SNPs and another constructed from SNPs that well predict the phenotype again controls type I error and yields more power than the traditional LMM. Software implementing these improvements and the experimental comparisons are available at http://microsoft.com/science.

Show MeSH
Graphical models for the data-generation process.The variable l is hidden (latent) and corresponds to confounding structure,either population structure of family relatedness. The variables c and swith subscripts correspond to causal and non-causal SNPs, respectively.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4230738&req=5

f3: Graphical models for the data-generation process.The variable l is hidden (latent) and corresponds to confounding structure,either population structure of family relatedness. The variables c and swith subscripts correspond to causal and non-causal SNPs, respectively.

Mentions: One way to understand these results is through consideration of the graphical-modelstructure25 for our data-generation process (Figure3a). Here, l is a hidden (latent) variable denoting which of the twoBalding-Nichols populations the individual is in, y is the phenotype, andc and s with subscripts denote the causal and non-causal SNPs,respectively. The graphical-model structure conveys assertions of conditionalindependence among the variables represented. For example, when l is unobserved,there are open paths between the SNP variables, reflecting induced correlation among theSNPs. In contrast, if l were observed, these paths would be closed, reflectingthe lack of correlation among SNPs within each population. Ref. 25 describes generally how to determine correlations from the graph basedon what is and is not observed.


Further improvements to linear mixed models for genome-wide association studies.

Widmer C, Lippert C, Weissbrod O, Fusi N, Kadie C, Davidson R, Listgarten J, Heckerman D - Sci Rep (2014)

Graphical models for the data-generation process.The variable l is hidden (latent) and corresponds to confounding structure,either population structure of family relatedness. The variables c and swith subscripts correspond to causal and non-causal SNPs, respectively.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4230738&req=5

f3: Graphical models for the data-generation process.The variable l is hidden (latent) and corresponds to confounding structure,either population structure of family relatedness. The variables c and swith subscripts correspond to causal and non-causal SNPs, respectively.
Mentions: One way to understand these results is through consideration of the graphical-modelstructure25 for our data-generation process (Figure3a). Here, l is a hidden (latent) variable denoting which of the twoBalding-Nichols populations the individual is in, y is the phenotype, andc and s with subscripts denote the causal and non-causal SNPs,respectively. The graphical-model structure conveys assertions of conditionalindependence among the variables represented. For example, when l is unobserved,there are open paths between the SNP variables, reflecting induced correlation among theSNPs. In contrast, if l were observed, these paths would be closed, reflectingthe lack of correlation among SNPs within each population. Ref. 25 describes generally how to determine correlations from the graph basedon what is and is not observed.

Bottom Line: Traditionally, all available SNPs are used to estimate the GSM.Specifically, when only population structure is present, a GSM constructed from SNPs that well predict the phenotype in combination with principal components as covariates controls type I error and yields more power than the traditional LMM.In any setting, with or without population structure or family relatedness, a GSM consisting of a mixture of two component GSMs, one constructed from all SNPs and another constructed from SNPs that well predict the phenotype again controls type I error and yields more power than the traditional LMM.

View Article: PubMed Central - PubMed

Affiliation: eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite PH1, Los Angeles, CA, 90024, United States.

ABSTRACT
We examine improvements to the linear mixed model (LMM) that better correct for population structure and family relatedness in genome-wide association studies (GWAS). LMMs rely on the estimation of a genetic similarity matrix (GSM), which encodes the pairwise similarity between every two individuals in a cohort. These similarities are estimated from single nucleotide polymorphisms (SNPs) or other genetic variants. Traditionally, all available SNPs are used to estimate the GSM. In empirical studies across a wide range of synthetic and real data, we find that modifications to this approach improve GWAS performance as measured by type I error control and power. Specifically, when only population structure is present, a GSM constructed from SNPs that well predict the phenotype in combination with principal components as covariates controls type I error and yields more power than the traditional LMM. In any setting, with or without population structure or family relatedness, a GSM consisting of a mixture of two component GSMs, one constructed from all SNPs and another constructed from SNPs that well predict the phenotype again controls type I error and yields more power than the traditional LMM. Software implementing these improvements and the experimental comparisons are available at http://microsoft.com/science.

Show MeSH