Limits...
Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons

View Article: PubMed Central - PubMed

ABSTRACT

Background: The identification of genomic biomarkers is a key step towards improving diagnostic tests and therapies. We present a reference-free method for this task that relies on a k-mer representation of genomes and a machine learning algorithm that produces intelligible models. The method is computationally scalable and well-suited for whole genome sequencing studies.

Results: The method was validated by generating models that predict the antibiotic resistance of C. difficile, M. tuberculosis, P. aeruginosa, and S. pneumoniae for 17 antibiotics. The obtained models are accurate, faithful to the biological pathways targeted by the antibiotics, and they provide insight into the process of resistance acquisition. Moreover, a theoretical analysis of the method revealed tight statistical guarantees on the accuracy of the obtained models, supporting its relevance for genomic biomarker discovery.

Conclusions: Our method allows the generation of accurate and interpretable predictive models of phenotypes, which rely on a small set of genomic variations. The method is not limited to predicting antibiotic resistance in bacteria and is applicable to a variety of organisms and phenotypes. Kover, an efficient implementation of our method, is open-source and should guide biological efforts to understand a plethora of phenotypes (http://github.com/aldro61/kover/).

Electronic supplementary material: The online version of this article (doi:10.1186/s12864-016-2889-6) contains supplementary material, which is available to authorized users.

No MeSH data available.


Related in: MedlinePlus

Going beyond k-mers: This figure shows the location, on the katG gene, of each k-mer targeted by the isoniazid model (rule and equivalent rules). All the k-mers overlap a concise locus, suggesting that it contains a point mutation that is associated with the phenotype. A multiple sequence alignment revealed a high level of polymorphism at codon 315 (shown in red). The wild-type sequence (WT), as well as the resistance conferring variants S315G, S315I, S315N and S315T, were observed. The rule in the model captures the absence of WT and thus, includes the occurrence of all the observed variants
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC5037627&req=5

Fig3: Going beyond k-mers: This figure shows the location, on the katG gene, of each k-mer targeted by the isoniazid model (rule and equivalent rules). All the k-mers overlap a concise locus, suggesting that it contains a point mutation that is associated with the phenotype. A multiple sequence alignment revealed a high level of polymorphism at codon 315 (shown in red). The wild-type sequence (WT), as well as the resistance conferring variants S315G, S315I, S315N and S315T, were observed. The rule in the model captures the absence of WT and thus, includes the occurrence of all the observed variants

Mentions: For M. tuberculosis, the isoniazid resistance model contains a single rule which targets the katG gene. This gene encodes the catalase-peroxidase enzyme (KatG), which is responsible for activating isoniazid, a prodrug, into its toxic form. As illustrated in Fig. 3, the k-mers associated with this rule and its equivalent rules all overlap a concise locus of katG, suggesting the occurrence of a point mutation. This locus contains codon 315 of KatG, where mutations S315I, S315G, S315N and S315T are all known to result in resistance [36, 37]. A multiple sequence alignment revealed that these variants were all present in the dataset. The SCM therefore selected a rule that captures the absence of the wild-type sequence at this locus, effectively including the presence of all the observed variants.Fig. 3


Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons
Going beyond k-mers: This figure shows the location, on the katG gene, of each k-mer targeted by the isoniazid model (rule and equivalent rules). All the k-mers overlap a concise locus, suggesting that it contains a point mutation that is associated with the phenotype. A multiple sequence alignment revealed a high level of polymorphism at codon 315 (shown in red). The wild-type sequence (WT), as well as the resistance conferring variants S315G, S315I, S315N and S315T, were observed. The rule in the model captures the absence of WT and thus, includes the occurrence of all the observed variants
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC5037627&req=5

Fig3: Going beyond k-mers: This figure shows the location, on the katG gene, of each k-mer targeted by the isoniazid model (rule and equivalent rules). All the k-mers overlap a concise locus, suggesting that it contains a point mutation that is associated with the phenotype. A multiple sequence alignment revealed a high level of polymorphism at codon 315 (shown in red). The wild-type sequence (WT), as well as the resistance conferring variants S315G, S315I, S315N and S315T, were observed. The rule in the model captures the absence of WT and thus, includes the occurrence of all the observed variants
Mentions: For M. tuberculosis, the isoniazid resistance model contains a single rule which targets the katG gene. This gene encodes the catalase-peroxidase enzyme (KatG), which is responsible for activating isoniazid, a prodrug, into its toxic form. As illustrated in Fig. 3, the k-mers associated with this rule and its equivalent rules all overlap a concise locus of katG, suggesting the occurrence of a point mutation. This locus contains codon 315 of KatG, where mutations S315I, S315G, S315N and S315T are all known to result in resistance [36, 37]. A multiple sequence alignment revealed that these variants were all present in the dataset. The SCM therefore selected a rule that captures the absence of the wild-type sequence at this locus, effectively including the presence of all the observed variants.Fig. 3

View Article: PubMed Central - PubMed

ABSTRACT

Background: The identification of genomic biomarkers is a key step towards improving diagnostic tests and therapies. We present a reference-free method for this task that relies on a k-mer representation of genomes and a machine learning algorithm that produces intelligible models. The method is computationally scalable and well-suited for whole genome sequencing studies.

Results: The method was validated by generating models that predict the antibiotic resistance of C. difficile, M. tuberculosis, P. aeruginosa, and S. pneumoniae for 17 antibiotics. The obtained models are accurate, faithful to the biological pathways targeted by the antibiotics, and they provide insight into the process of resistance acquisition. Moreover, a theoretical analysis of the method revealed tight statistical guarantees on the accuracy of the obtained models, supporting its relevance for genomic biomarker discovery.

Conclusions: Our method allows the generation of accurate and interpretable predictive models of phenotypes, which rely on a small set of genomic variations. The method is not limited to predicting antibiotic resistance in bacteria and is applicable to a variety of organisms and phenotypes. Kover, an efficient implementation of our method, is open-source and should guide biological efforts to understand a plethora of phenotypes (http://github.com/aldro61/kover/).

Electronic supplementary material: The online version of this article (doi:10.1186/s12864-016-2889-6) contains supplementary material, which is available to authorized users.

No MeSH data available.


Related in: MedlinePlus