Limits...
A phylogeny-based sampling strategy and power calculator informs genome-wide associations study design for microbial pathogens.

Farhat MR, Shapiro BJ, Sheppard SK, Colijn C, Murray M - Genome Med (2014)

Bottom Line: Here we consider general methodological questions related to sampling and analysis focusing on clonal to moderately recombining pathogens.We propose that a matched sampling scheme constitutes an efficient study design, and provide a power calculator based on phylogenetic convergence.We demonstrate this approach by applying it to genomic datasets for two microbial pathogens: Mycobacterium tuberculosis and Campylobacter species.

View Article: PubMed Central - PubMed

Affiliation: Department of Pulmonary and Critical Care, Massachusetts General Hospital, Harvard Medical School, Boston, MA USA ; Department of Global Health and Social Medicine, Harvard Medical School, 641 Huntington Avenue Suite 4A, Boston, MA 02115 USA.

ABSTRACT
Whole genome sequencing is increasingly used to study phenotypic variation among infectious pathogens and to evaluate their relative transmissibility, virulence, and immunogenicity. To date, relatively little has been published on how and how many pathogen strains should be selected for studies associating phenotype and genotype. There are specific challenges when identifying genetic associations in bacteria which often comprise highly structured populations. Here we consider general methodological questions related to sampling and analysis focusing on clonal to moderately recombining pathogens. We propose that a matched sampling scheme constitutes an efficient study design, and provide a power calculator based on phylogenetic convergence. We demonstrate this approach by applying it to genomic datasets for two microbial pathogens: Mycobacterium tuberculosis and Campylobacter species.

No MeSH data available.


Related in: MedlinePlus

Power of the matched convergence test for identify nucleotide sites associated with a phenotype of interest. The average genetic distance between matched strains was set to an intermediate level of s = 100 mutations. Colors represent increasing values of site effect size fsite.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4256898&req=5

Fig3: Power of the matched convergence test for identify nucleotide sites associated with a phenotype of interest. The average genetic distance between matched strains was set to an intermediate level of s = 100 mutations. Colors represent increasing values of site effect size fsite.

Mentions: As observed for rpoB, more than one nucleotide site in a locus can carry a fitness conferring variant; we can thus formulate a locus-level test by defining a distribution for the sum of the variant counts in a locus, li_locus. If locus i of length gi is not under selection, with the same parameters s and k defined above, then the distribution of li_locus can be approximated by a Poisson distribution with a rate = n s gi/k. Under the alternative hypothesis, this locus is under selection and the expected number of mutations is n flocus, which is larger than n s gi/k. Similar to fsite, flocus is related to the collective fitness advantage conferred by its variants. For example, in the study cited above, flocus was estimated to be 0.30 to 1.5/locus/ph+ strain for the thyA locus for MTB p-aminosalicylic resistance, and rpoB locus for RIF resistance, respectively [15]. The test will have a different power for different values of fsite/locus. Because this analysis involves testing all the sites and loci with observed variation, a correction for multiple testing is needed. We use the Bonferroni correction, assuming that the upper limit for the number of variable sites across the sample is n s, and the number of variable loci to be 1- e-n gis/k (from the Poisson distribution). In Figures 3, 4, and 5, we provide power calculation results as a function of n, s and f using the 4.41 Mbp MTB genome as an example. Here we calculated the expected power by integrating across the distribution of locus lengths gi for the MTB reference genome H37Rv. Based on previous data from fingerprint-matched MTB, our power calculations explored a range of between-strain genetic distances (s) from 50 to 300 mutations [4].Figure 3


A phylogeny-based sampling strategy and power calculator informs genome-wide associations study design for microbial pathogens.

Farhat MR, Shapiro BJ, Sheppard SK, Colijn C, Murray M - Genome Med (2014)

Power of the matched convergence test for identify nucleotide sites associated with a phenotype of interest. The average genetic distance between matched strains was set to an intermediate level of s = 100 mutations. Colors represent increasing values of site effect size fsite.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4256898&req=5

Fig3: Power of the matched convergence test for identify nucleotide sites associated with a phenotype of interest. The average genetic distance between matched strains was set to an intermediate level of s = 100 mutations. Colors represent increasing values of site effect size fsite.
Mentions: As observed for rpoB, more than one nucleotide site in a locus can carry a fitness conferring variant; we can thus formulate a locus-level test by defining a distribution for the sum of the variant counts in a locus, li_locus. If locus i of length gi is not under selection, with the same parameters s and k defined above, then the distribution of li_locus can be approximated by a Poisson distribution with a rate = n s gi/k. Under the alternative hypothesis, this locus is under selection and the expected number of mutations is n flocus, which is larger than n s gi/k. Similar to fsite, flocus is related to the collective fitness advantage conferred by its variants. For example, in the study cited above, flocus was estimated to be 0.30 to 1.5/locus/ph+ strain for the thyA locus for MTB p-aminosalicylic resistance, and rpoB locus for RIF resistance, respectively [15]. The test will have a different power for different values of fsite/locus. Because this analysis involves testing all the sites and loci with observed variation, a correction for multiple testing is needed. We use the Bonferroni correction, assuming that the upper limit for the number of variable sites across the sample is n s, and the number of variable loci to be 1- e-n gis/k (from the Poisson distribution). In Figures 3, 4, and 5, we provide power calculation results as a function of n, s and f using the 4.41 Mbp MTB genome as an example. Here we calculated the expected power by integrating across the distribution of locus lengths gi for the MTB reference genome H37Rv. Based on previous data from fingerprint-matched MTB, our power calculations explored a range of between-strain genetic distances (s) from 50 to 300 mutations [4].Figure 3

Bottom Line: Here we consider general methodological questions related to sampling and analysis focusing on clonal to moderately recombining pathogens.We propose that a matched sampling scheme constitutes an efficient study design, and provide a power calculator based on phylogenetic convergence.We demonstrate this approach by applying it to genomic datasets for two microbial pathogens: Mycobacterium tuberculosis and Campylobacter species.

View Article: PubMed Central - PubMed

Affiliation: Department of Pulmonary and Critical Care, Massachusetts General Hospital, Harvard Medical School, Boston, MA USA ; Department of Global Health and Social Medicine, Harvard Medical School, 641 Huntington Avenue Suite 4A, Boston, MA 02115 USA.

ABSTRACT
Whole genome sequencing is increasingly used to study phenotypic variation among infectious pathogens and to evaluate their relative transmissibility, virulence, and immunogenicity. To date, relatively little has been published on how and how many pathogen strains should be selected for studies associating phenotype and genotype. There are specific challenges when identifying genetic associations in bacteria which often comprise highly structured populations. Here we consider general methodological questions related to sampling and analysis focusing on clonal to moderately recombining pathogens. We propose that a matched sampling scheme constitutes an efficient study design, and provide a power calculator based on phylogenetic convergence. We demonstrate this approach by applying it to genomic datasets for two microbial pathogens: Mycobacterium tuberculosis and Campylobacter species.

No MeSH data available.


Related in: MedlinePlus