Limits...
A phylogeny-based sampling strategy and power calculator informs genome-wide associations study design for microbial pathogens.

Farhat MR, Shapiro BJ, Sheppard SK, Colijn C, Murray M - Genome Med (2014)

Bottom Line: Here we consider general methodological questions related to sampling and analysis focusing on clonal to moderately recombining pathogens.We propose that a matched sampling scheme constitutes an efficient study design, and provide a power calculator based on phylogenetic convergence.We demonstrate this approach by applying it to genomic datasets for two microbial pathogens: Mycobacterium tuberculosis and Campylobacter species.

View Article: PubMed Central - PubMed

Affiliation: Department of Pulmonary and Critical Care, Massachusetts General Hospital, Harvard Medical School, Boston, MA USA ; Department of Global Health and Social Medicine, Harvard Medical School, 641 Huntington Avenue Suite 4A, Boston, MA 02115 USA.

ABSTRACT
Whole genome sequencing is increasingly used to study phenotypic variation among infectious pathogens and to evaluate their relative transmissibility, virulence, and immunogenicity. To date, relatively little has been published on how and how many pathogen strains should be selected for studies associating phenotype and genotype. There are specific challenges when identifying genetic associations in bacteria which often comprise highly structured populations. Here we consider general methodological questions related to sampling and analysis focusing on clonal to moderately recombining pathogens. We propose that a matched sampling scheme constitutes an efficient study design, and provide a power calculator based on phylogenetic convergence. We demonstrate this approach by applying it to genomic datasets for two microbial pathogens: Mycobacterium tuberculosis and Campylobacter species.

No MeSH data available.


Related in: MedlinePlus

Phylogeny of MTB strains chosen for genotype-phenotype analysis. Dots indicate the presence of the drug resistant phenotype. The tree demonstrates the matching of strains with and without the drug resistance phenotype.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4256898&req=5

Fig6: Phylogeny of MTB strains chosen for genotype-phenotype analysis. Dots indicate the presence of the drug resistant phenotype. The tree demonstrates the matching of strains with and without the drug resistance phenotype.

Mentions: We applied the sampling strategy described in Figure 2 to a set of 123 clinically isolated unmatched MTB genomes previously analyzed using phylogenetic convergence [15] (Additional files 3 and 4). Repetitive, transposon, and phage-related regions were removed as putatively recombinant or as error-prone regions of the alignment. Of the 123 strains, 47 were resistant to one or more drugs (ph+) and the rest were sensitive (ph-). As different fingerprinting methods were used for the different strains in this study and for demonstration purposes we used the phylogeny constructed using whole genome single nucleotide polymorphisms to match strains. We chose eight pairs of strains using this selection strategy (Figure 6). We then counted the recent mutational changes (single nucleotide polymorphisms; SNPs) between each pair of strains. The average distance (s) between pairs was 109 SNPs and was in the range of 12 to 254 SNPs. We calculated the number of changes per gene across the eight pairs and compared this number to a Poisson distribution of mutations randomly distributed across branches as the distribution. We then identified the tail of the distribution, containing genes with a high number of changes highly associated with drug resistance (Figure 7). Overall, 12 genes and non-coding regions were found to be associated with drug resistance using only 16 out of 123 strains (13%) used in the original analysis. The analysis identified katG, embB, rpoB (well known drug resistance determinants) as well as top new candidates from the previous full analysis of all 123 genomes: ponA1, ppsA, murD, and rbsk. This selection strategy and analysis recovered 67% of the candidates identified with the full analysis, but used only 13% of the data, demonstrating the superior power of the matched convergence analysis to the general unmatched test.Figure 6


A phylogeny-based sampling strategy and power calculator informs genome-wide associations study design for microbial pathogens.

Farhat MR, Shapiro BJ, Sheppard SK, Colijn C, Murray M - Genome Med (2014)

Phylogeny of MTB strains chosen for genotype-phenotype analysis. Dots indicate the presence of the drug resistant phenotype. The tree demonstrates the matching of strains with and without the drug resistance phenotype.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4256898&req=5

Fig6: Phylogeny of MTB strains chosen for genotype-phenotype analysis. Dots indicate the presence of the drug resistant phenotype. The tree demonstrates the matching of strains with and without the drug resistance phenotype.
Mentions: We applied the sampling strategy described in Figure 2 to a set of 123 clinically isolated unmatched MTB genomes previously analyzed using phylogenetic convergence [15] (Additional files 3 and 4). Repetitive, transposon, and phage-related regions were removed as putatively recombinant or as error-prone regions of the alignment. Of the 123 strains, 47 were resistant to one or more drugs (ph+) and the rest were sensitive (ph-). As different fingerprinting methods were used for the different strains in this study and for demonstration purposes we used the phylogeny constructed using whole genome single nucleotide polymorphisms to match strains. We chose eight pairs of strains using this selection strategy (Figure 6). We then counted the recent mutational changes (single nucleotide polymorphisms; SNPs) between each pair of strains. The average distance (s) between pairs was 109 SNPs and was in the range of 12 to 254 SNPs. We calculated the number of changes per gene across the eight pairs and compared this number to a Poisson distribution of mutations randomly distributed across branches as the distribution. We then identified the tail of the distribution, containing genes with a high number of changes highly associated with drug resistance (Figure 7). Overall, 12 genes and non-coding regions were found to be associated with drug resistance using only 16 out of 123 strains (13%) used in the original analysis. The analysis identified katG, embB, rpoB (well known drug resistance determinants) as well as top new candidates from the previous full analysis of all 123 genomes: ponA1, ppsA, murD, and rbsk. This selection strategy and analysis recovered 67% of the candidates identified with the full analysis, but used only 13% of the data, demonstrating the superior power of the matched convergence analysis to the general unmatched test.Figure 6

Bottom Line: Here we consider general methodological questions related to sampling and analysis focusing on clonal to moderately recombining pathogens.We propose that a matched sampling scheme constitutes an efficient study design, and provide a power calculator based on phylogenetic convergence.We demonstrate this approach by applying it to genomic datasets for two microbial pathogens: Mycobacterium tuberculosis and Campylobacter species.

View Article: PubMed Central - PubMed

Affiliation: Department of Pulmonary and Critical Care, Massachusetts General Hospital, Harvard Medical School, Boston, MA USA ; Department of Global Health and Social Medicine, Harvard Medical School, 641 Huntington Avenue Suite 4A, Boston, MA 02115 USA.

ABSTRACT
Whole genome sequencing is increasingly used to study phenotypic variation among infectious pathogens and to evaluate their relative transmissibility, virulence, and immunogenicity. To date, relatively little has been published on how and how many pathogen strains should be selected for studies associating phenotype and genotype. There are specific challenges when identifying genetic associations in bacteria which often comprise highly structured populations. Here we consider general methodological questions related to sampling and analysis focusing on clonal to moderately recombining pathogens. We propose that a matched sampling scheme constitutes an efficient study design, and provide a power calculator based on phylogenetic convergence. We demonstrate this approach by applying it to genomic datasets for two microbial pathogens: Mycobacterium tuberculosis and Campylobacter species.

No MeSH data available.


Related in: MedlinePlus