Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons
View Article:
PubMed Central - PubMed
ABSTRACT
Background: The identification of genomic biomarkers is a key step towards improving diagnostic tests and therapies. We present a reference-free method for this task that relies on a k-mer representation of genomes and a machine learning algorithm that produces intelligible models. The method is computationally scalable and well-suited for whole genome sequencing studies. Results: The method was validated by generating models that predict the antibiotic resistance of C. difficile, M. tuberculosis, P. aeruginosa, and S. pneumoniae for 17 antibiotics. The obtained models are accurate, faithful to the biological pathways targeted by the antibiotics, and they provide insight into the process of resistance acquisition. Moreover, a theoretical analysis of the method revealed tight statistical guarantees on the accuracy of the obtained models, supporting its relevance for genomic biomarker discovery. Conclusions: Our method allows the generation of accurate and interpretable predictive models of phenotypes, which rely on a small set of genomic variations. The method is not limited to predicting antibiotic resistance in bacteria and is applicable to a variety of organisms and phenotypes. Kover, an efficient implementation of our method, is open-source and should guide biological efforts to understand a plethora of phenotypes (http://github.com/aldro61/kover/). Electronic supplementary material: The online version of this article (doi:10.1186/s12864-016-2889-6) contains supplementary material, which is available to authorized users. No MeSH data available. |
Related In:
Results -
Collection
License 1 - License 2 getmorefigures.php?uid=PMC5037627&req=5
Mentions: We represent each genome by the presence or absence of each possible k-mer. There are 4k possible k-mers and hence, for k=31, we consider 431>4·1018k-mers. Let be the set of all, possibly overlapping, k-mers present in at least one genome of the training set . Observe that omits k-mers that are absent in and thus non-discriminatory, which allows the SCM to efficiently work in this enormous feature space. Then, for each genome x, let be a dimensional vector, such that its component ϕi(x)=1 if the i-thk-mer of is present in x and 0 otherwise. An example of this representation is given in Fig. 5. We consider two types of boolean-valued rules: presence rules and absence rules, which rely on the vectors ϕ(x) to determine their outcome. For each k-mer , we define a presence rule as and an absence rule as , where I[a]=1 if a is true and I[a]=0 otherwise. The SCM, which is detailed in Additional file 1: Appendix 1, can then be applied by using {(ϕ(x1),y1),…,ϕ(xm),ym)} as the set of learning examples and by using the set of presence/absence rules defined above as the set of boolean-valued rules. This yields a phenotypic model which explicitly highlights the importance of a small set of k-mers. In addition, this model has a form which is simple to interpret, since its predictions are the result of a simple logical operation.Fig. 5 |
View Article: PubMed Central - PubMed
Background: The identification of genomic biomarkers is a key step towards improving diagnostic tests and therapies. We present a reference-free method for this task that relies on a k-mer representation of genomes and a machine learning algorithm that produces intelligible models. The method is computationally scalable and well-suited for whole genome sequencing studies.
Results: The method was validated by generating models that predict the antibiotic resistance of C. difficile, M. tuberculosis, P. aeruginosa, and S. pneumoniae for 17 antibiotics. The obtained models are accurate, faithful to the biological pathways targeted by the antibiotics, and they provide insight into the process of resistance acquisition. Moreover, a theoretical analysis of the method revealed tight statistical guarantees on the accuracy of the obtained models, supporting its relevance for genomic biomarker discovery.
Conclusions: Our method allows the generation of accurate and interpretable predictive models of phenotypes, which rely on a small set of genomic variations. The method is not limited to predicting antibiotic resistance in bacteria and is applicable to a variety of organisms and phenotypes. Kover, an efficient implementation of our method, is open-source and should guide biological efforts to understand a plethora of phenotypes (http://github.com/aldro61/kover/).
Electronic supplementary material: The online version of this article (doi:10.1186/s12864-016-2889-6) contains supplementary material, which is available to authorized users.
No MeSH data available.