Limits...
Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons

View Article: PubMed Central - PubMed

ABSTRACT

Background: The identification of genomic biomarkers is a key step towards improving diagnostic tests and therapies. We present a reference-free method for this task that relies on a k-mer representation of genomes and a machine learning algorithm that produces intelligible models. The method is computationally scalable and well-suited for whole genome sequencing studies.

Results: The method was validated by generating models that predict the antibiotic resistance of C. difficile, M. tuberculosis, P. aeruginosa, and S. pneumoniae for 17 antibiotics. The obtained models are accurate, faithful to the biological pathways targeted by the antibiotics, and they provide insight into the process of resistance acquisition. Moreover, a theoretical analysis of the method revealed tight statistical guarantees on the accuracy of the obtained models, supporting its relevance for genomic biomarker discovery.

Conclusions: Our method allows the generation of accurate and interpretable predictive models of phenotypes, which rely on a small set of genomic variations. The method is not limited to predicting antibiotic resistance in bacteria and is applicable to a variety of organisms and phenotypes. Kover, an efficient implementation of our method, is open-source and should guide biological efforts to understand a plethora of phenotypes (http://github.com/aldro61/kover/).

Electronic supplementary material: The online version of this article (doi:10.1186/s12864-016-2889-6) contains supplementary material, which is available to authorized users.

No MeSH data available.


Related in: MedlinePlus

Overcoming spurious correlations: This figures shows how spurious correlations in the M. tuberculosis data affect the models produced by the Set Covering Machine. a For each M. tuberculosis dataset, the proportion of isolates that are identically labeled in each other dataset is shown. This proportion is calculated using Eq. (2). b The antibiotic resistance models learned by the SCM at each iteration of the correlation removal procedure. Each model is represented by a rounded rectangle identified by the round number and the estimated error rate. All the models are disjunctions (logical-OR). The circular nodes correspond to k-mer rules. A single border indicates a presence rule and a double border indicates an absence rule. The numbers in the circles show to the number of equivalent rules. A rule is connected to an antibiotic if it was included in its model. The weight of the edges gives the importance of each rule
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC5037627&req=5

Fig4: Overcoming spurious correlations: This figures shows how spurious correlations in the M. tuberculosis data affect the models produced by the Set Covering Machine. a For each M. tuberculosis dataset, the proportion of isolates that are identically labeled in each other dataset is shown. This proportion is calculated using Eq. (2). b The antibiotic resistance models learned by the SCM at each iteration of the correlation removal procedure. Each model is represented by a rounded rectangle identified by the round number and the estimated error rate. All the models are disjunctions (logical-OR). The circular nodes correspond to k-mer rules. A single border indicates a presence rule and a double border indicates an absence rule. The numbers in the circles show to the number of equivalent rules. A rule is connected to an antibiotic if it was included in its model. The weight of the edges gives the importance of each rule

Mentions: One notable example of such a situation is the strong correlation in resistance to antibiotics that do not share common mechanisms of action. These correlations might originate from treatment regimens. For instance, Fig. 4a shows, for M. tuberculosis, the proportion of isolates that are identically labeled (resistant or sensitive) for each pair of antibiotics. More formally, this figure shows a matrix C, where each entry Cij corresponds to a pair of datasets , and 2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ C_{ij} \overset{\text{\tiny{def}}}{=} \frac{/\{({\mathbf{x}}, y) \in S_{i} : ({\mathbf{x}}, y) \in \mathcal{S}_{j}\}/}{/\mathcal{S}_{i}/}. $$ \end{document}Cij=def/{(x,y)∈Si:(x,y)∈Sj}//Si/.Fig. 4


Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons
Overcoming spurious correlations: This figures shows how spurious correlations in the M. tuberculosis data affect the models produced by the Set Covering Machine. a For each M. tuberculosis dataset, the proportion of isolates that are identically labeled in each other dataset is shown. This proportion is calculated using Eq. (2). b The antibiotic resistance models learned by the SCM at each iteration of the correlation removal procedure. Each model is represented by a rounded rectangle identified by the round number and the estimated error rate. All the models are disjunctions (logical-OR). The circular nodes correspond to k-mer rules. A single border indicates a presence rule and a double border indicates an absence rule. The numbers in the circles show to the number of equivalent rules. A rule is connected to an antibiotic if it was included in its model. The weight of the edges gives the importance of each rule
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC5037627&req=5

Fig4: Overcoming spurious correlations: This figures shows how spurious correlations in the M. tuberculosis data affect the models produced by the Set Covering Machine. a For each M. tuberculosis dataset, the proportion of isolates that are identically labeled in each other dataset is shown. This proportion is calculated using Eq. (2). b The antibiotic resistance models learned by the SCM at each iteration of the correlation removal procedure. Each model is represented by a rounded rectangle identified by the round number and the estimated error rate. All the models are disjunctions (logical-OR). The circular nodes correspond to k-mer rules. A single border indicates a presence rule and a double border indicates an absence rule. The numbers in the circles show to the number of equivalent rules. A rule is connected to an antibiotic if it was included in its model. The weight of the edges gives the importance of each rule
Mentions: One notable example of such a situation is the strong correlation in resistance to antibiotics that do not share common mechanisms of action. These correlations might originate from treatment regimens. For instance, Fig. 4a shows, for M. tuberculosis, the proportion of isolates that are identically labeled (resistant or sensitive) for each pair of antibiotics. More formally, this figure shows a matrix C, where each entry Cij corresponds to a pair of datasets , and 2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ C_{ij} \overset{\text{\tiny{def}}}{=} \frac{/\{({\mathbf{x}}, y) \in S_{i} : ({\mathbf{x}}, y) \in \mathcal{S}_{j}\}/}{/\mathcal{S}_{i}/}. $$ \end{document}Cij=def/{(x,y)∈Si:(x,y)∈Sj}//Si/.Fig. 4

View Article: PubMed Central - PubMed

ABSTRACT

Background: The identification of genomic biomarkers is a key step towards improving diagnostic tests and therapies. We present a reference-free method for this task that relies on a k-mer representation of genomes and a machine learning algorithm that produces intelligible models. The method is computationally scalable and well-suited for whole genome sequencing studies.

Results: The method was validated by generating models that predict the antibiotic resistance of C. difficile, M. tuberculosis, P. aeruginosa, and S. pneumoniae for 17 antibiotics. The obtained models are accurate, faithful to the biological pathways targeted by the antibiotics, and they provide insight into the process of resistance acquisition. Moreover, a theoretical analysis of the method revealed tight statistical guarantees on the accuracy of the obtained models, supporting its relevance for genomic biomarker discovery.

Conclusions: Our method allows the generation of accurate and interpretable predictive models of phenotypes, which rely on a small set of genomic variations. The method is not limited to predicting antibiotic resistance in bacteria and is applicable to a variety of organisms and phenotypes. Kover, an efficient implementation of our method, is open-source and should guide biological efforts to understand a plethora of phenotypes (http://github.com/aldro61/kover/).

Electronic supplementary material: The online version of this article (doi:10.1186/s12864-016-2889-6) contains supplementary material, which is available to authorized users.

No MeSH data available.


Related in: MedlinePlus