Limits...
Correction of population stratification in large multi-ethnic association studies.

Serre D, Montpetit A, Paré G, Engert JC, Yusuf S, Keavney B, Hudson TJ, Anand S - PLoS ONE (2008)

Bottom Line: The vast majority of genetic risk factors for complex diseases have, taken individually, a small effect on the end phenotype.Population-based association studies therefore need very large sample sizes to detect significant differences between affected and non-affected individuals.Our results highlight the importance of carefully addressing population stratification and of carefully "cleaning" the sample prior to analyses to obtain stronger signals of association and to avoid spurious results.

View Article: PubMed Central - PubMed

Affiliation: Genome Quebec Innovation Centre, McGill University, Montreal, Quebec, Canada.

ABSTRACT

Background: The vast majority of genetic risk factors for complex diseases have, taken individually, a small effect on the end phenotype. Population-based association studies therefore need very large sample sizes to detect significant differences between affected and non-affected individuals. Including thousands of affected individuals in a study requires recruitment in numerous centers, possibly from different geographic regions. Unfortunately such a recruitment strategy is likely to complicate the study design and to generate concerns regarding population stratification.

Methodology/principal findings: We analyzed 9,751 individuals representing three main ethnic groups - Europeans, Arabs and South Asians - that had been enrolled from 154 centers involving 52 countries for a global case/control study of acute myocardial infarction. All individuals were genotyped at 103 candidate genes using 1,536 SNPs selected with a tagging strategy that captures most of the genetic diversity in different populations. We show that relying solely on self-reported ethnicity is not sufficient to exclude population stratification and we present additional methods to identify and correct for stratification.

Conclusions/significance: Our results highlight the importance of carefully addressing population stratification and of carefully "cleaning" the sample prior to analyses to obtain stronger signals of association and to avoid spurious results.

Show MeSH

Related in: MedlinePlus

Distribution of the p-values of the associations between genotypes at                            1,453 SNPs and ApoB level in South-Asians.The plot shows the observed distribution of the p-values (y-axis) against                            the expectation under a model without any association (grey crosses and                            x-axis). The axes are in logarithmic scales. Red crosses correspond to                            the association between ApoB and the genotypes at one SNP without any                            correction. Blue crosses stand for the same tests using recruitment                            centers used as additional covariates.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2198793&req=5

pone-0001382-g003: Distribution of the p-values of the associations between genotypes at 1,453 SNPs and ApoB level in South-Asians.The plot shows the observed distribution of the p-values (y-axis) against the expectation under a model without any association (grey crosses and x-axis). The axes are in logarithmic scales. Red crosses correspond to the association between ApoB and the genotypes at one SNP without any correction. Blue crosses stand for the same tests using recruitment centers used as additional covariates.

Mentions: We tested separately in each population sample the association between genotypes and Apolipoprotein B (ApoB) concentration (see Materials and Methods for details). Figure 3 shows the distribution of the p-value obtained for each SNP in the South-Asian dataset. The figure shows a global deviation (towards more significant associations) from the pattern expected by chance if there is no association between genotypes and ApoB concentration. This deviation is not limited to a few outliers but affects the entire distribution. This could be an indication that many SNPs (i.e. several hundred) in our panel are significantly associated with ApoB level or, alternatively, that a previously undetected stratification in the dataset affects the results. We first tried to correct this global deviation by using the coefficients of ancestry estimated by STRUCTURE for each individual as covariates in the ANOVA. This did not lead to any significant difference in the distribution of the p-values (see Supplemental Figure S5). We then tested whether the geographic origin of the individuals could influence the associations. After using the recruitment centers as covariates of the analyses, the distribution of the p-values for the South-Asian individuals fitted much better the distribution expected under no association, and only five SNPs (most notably rs429358 in APOE) showed significant deviation from the expectation and strong association with ApoB concentration (Figure 3). Similar patterns were observed in the Arab and, to a lesser extent, in the European datasets (data not shown). Correcting the association tests for the recruitment centers thus led to a dramatic change in the overall distribution of the associations with some of the SNPs showing up to two orders of magnitude decrease in statistical significance. It is important to note here that the stratification observed among centers is not due to a systematic difference in DNA preparation or storage between centers. The INTERHEART protocol requires that, for every case recruited, at least one control (same sex, same age) is recruited from the same center. Blood samples (or buffy coats) from cases and controls are then shipped to Canada and treated identically (after randomization). However, due to stochastic failures at different stages (e.g. DNA extractions, genotyping) some centers included more cases than controls (or inversely) at the end of the study which contributes to the observed stratification effect (in combination with allele frequency differences among centers).


Correction of population stratification in large multi-ethnic association studies.

Serre D, Montpetit A, Paré G, Engert JC, Yusuf S, Keavney B, Hudson TJ, Anand S - PLoS ONE (2008)

Distribution of the p-values of the associations between genotypes at                            1,453 SNPs and ApoB level in South-Asians.The plot shows the observed distribution of the p-values (y-axis) against                            the expectation under a model without any association (grey crosses and                            x-axis). The axes are in logarithmic scales. Red crosses correspond to                            the association between ApoB and the genotypes at one SNP without any                            correction. Blue crosses stand for the same tests using recruitment                            centers used as additional covariates.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2198793&req=5

pone-0001382-g003: Distribution of the p-values of the associations between genotypes at 1,453 SNPs and ApoB level in South-Asians.The plot shows the observed distribution of the p-values (y-axis) against the expectation under a model without any association (grey crosses and x-axis). The axes are in logarithmic scales. Red crosses correspond to the association between ApoB and the genotypes at one SNP without any correction. Blue crosses stand for the same tests using recruitment centers used as additional covariates.
Mentions: We tested separately in each population sample the association between genotypes and Apolipoprotein B (ApoB) concentration (see Materials and Methods for details). Figure 3 shows the distribution of the p-value obtained for each SNP in the South-Asian dataset. The figure shows a global deviation (towards more significant associations) from the pattern expected by chance if there is no association between genotypes and ApoB concentration. This deviation is not limited to a few outliers but affects the entire distribution. This could be an indication that many SNPs (i.e. several hundred) in our panel are significantly associated with ApoB level or, alternatively, that a previously undetected stratification in the dataset affects the results. We first tried to correct this global deviation by using the coefficients of ancestry estimated by STRUCTURE for each individual as covariates in the ANOVA. This did not lead to any significant difference in the distribution of the p-values (see Supplemental Figure S5). We then tested whether the geographic origin of the individuals could influence the associations. After using the recruitment centers as covariates of the analyses, the distribution of the p-values for the South-Asian individuals fitted much better the distribution expected under no association, and only five SNPs (most notably rs429358 in APOE) showed significant deviation from the expectation and strong association with ApoB concentration (Figure 3). Similar patterns were observed in the Arab and, to a lesser extent, in the European datasets (data not shown). Correcting the association tests for the recruitment centers thus led to a dramatic change in the overall distribution of the associations with some of the SNPs showing up to two orders of magnitude decrease in statistical significance. It is important to note here that the stratification observed among centers is not due to a systematic difference in DNA preparation or storage between centers. The INTERHEART protocol requires that, for every case recruited, at least one control (same sex, same age) is recruited from the same center. Blood samples (or buffy coats) from cases and controls are then shipped to Canada and treated identically (after randomization). However, due to stochastic failures at different stages (e.g. DNA extractions, genotyping) some centers included more cases than controls (or inversely) at the end of the study which contributes to the observed stratification effect (in combination with allele frequency differences among centers).

Bottom Line: The vast majority of genetic risk factors for complex diseases have, taken individually, a small effect on the end phenotype.Population-based association studies therefore need very large sample sizes to detect significant differences between affected and non-affected individuals.Our results highlight the importance of carefully addressing population stratification and of carefully "cleaning" the sample prior to analyses to obtain stronger signals of association and to avoid spurious results.

View Article: PubMed Central - PubMed

Affiliation: Genome Quebec Innovation Centre, McGill University, Montreal, Quebec, Canada.

ABSTRACT

Background: The vast majority of genetic risk factors for complex diseases have, taken individually, a small effect on the end phenotype. Population-based association studies therefore need very large sample sizes to detect significant differences between affected and non-affected individuals. Including thousands of affected individuals in a study requires recruitment in numerous centers, possibly from different geographic regions. Unfortunately such a recruitment strategy is likely to complicate the study design and to generate concerns regarding population stratification.

Methodology/principal findings: We analyzed 9,751 individuals representing three main ethnic groups - Europeans, Arabs and South Asians - that had been enrolled from 154 centers involving 52 countries for a global case/control study of acute myocardial infarction. All individuals were genotyped at 103 candidate genes using 1,536 SNPs selected with a tagging strategy that captures most of the genetic diversity in different populations. We show that relying solely on self-reported ethnicity is not sufficient to exclude population stratification and we present additional methods to identify and correct for stratification.

Conclusions/significance: Our results highlight the importance of carefully addressing population stratification and of carefully "cleaning" the sample prior to analyses to obtain stronger signals of association and to avoid spurious results.

Show MeSH
Related in: MedlinePlus