Limits...
MLgsc: A Maximum-Likelihood General Sequence Classifier.

Junier T, Hervé V, Wunderlin T, Junier P - PLoS ONE (2015)

Bottom Line: The software was evaluated on all the 16S rRNA gene sequences of the reference dataset found in the GreenGenes database.Examples of applications based on the nitrogenase subunit NifH gene and a protein-coding gene found in endospore-forming Firmicutes is also presented.The package has minimal dependencies and thus can be easily integrated in command-line based classification pipelines.

View Article: PubMed Central - PubMed

Affiliation: Laboratory of Microbiology, University of Neuchâtel, Neuchâtel, Neuchâtel, Switzerland; Vital-IT Group, Swiss Institute of Bioinformatics, Lausanne, Vaud, Switzerland.

ABSTRACT
We present software package for classifying protein or nucleotide sequences to user-specified sets of reference sequences. The software trains a model using a multiple sequence alignment and a phylogenetic tree, both supplied by the user. The latter is used to guide model construction and as a decision tree to speed up the classification process. The software was evaluated on all the 16S rRNA gene sequences of the reference dataset found in the GreenGenes database. On this dataset, the software was shown to achieve an error rate of around 1% at genus level. Examples of applications based on the nitrogenase subunit NifH gene and a protein-coding gene found in endospore-forming Firmicutes is also presented. The programs in the package have a simple, straightforward command-line interface for the Unix shell, and are free and open-source. The package has minimal dependencies and thus can be easily integrated in command-line based classification pipelines.

No MeSH data available.


Classifying.The aligned query sequence (a) is first scored by the matrices at the root’s direct children nodes, in this example the matrix for taxa 1 and 2 (b) and the one for taxa 3 and 4 (c). Matrix (b) is found to yield the better score (solid arrow). Therefore, the query is now scored against matrix (b)’s children, namely taxa 1 and 2. The former is found to yield the better score, and OTU 1 is reported as the most likely (d). The shaded parts of the tree (matrices for taxa 3 and 4) are never tested. In a balanced, fully bifurcating tree of n nodes, only 2log2(n) matrices are tested.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4492669&req=5

pone.0129384.g002: Classifying.The aligned query sequence (a) is first scored by the matrices at the root’s direct children nodes, in this example the matrix for taxa 1 and 2 (b) and the one for taxa 3 and 4 (c). Matrix (b) is found to yield the better score (solid arrow). Therefore, the query is now scored against matrix (b)’s children, namely taxa 1 and 2. The former is found to yield the better score, and OTU 1 is reported as the most likely (d). The shaded parts of the tree (matrices for taxa 3 and 4) are never tested. In a balanced, fully bifurcating tree of n nodes, only 2log2(n) matrices are tested.

Mentions: The query sequence is first aligned to the PWM at the root of the tree, using semi-global alignment (local in the query and global in the PWM). The method uses dynamic programming similar to Needleman and Wunsch’s algorithm [20], but with match/mismatch scores based on weights in the PWM instead of being fixed, and gaps disallowed in the PWM. The aligned query is thus exactly as long as the PWM. The aligned sequence is then scored against the matrices at the children nodes of the root. The matrix yielding the highest score is selected, and the sequence is now scored against the children of this matrix, and so on recursively until the highest-scoring matrix is at a leaf (Fig 2). The corresponding reference is reported as the most likely, together with the path through the tree. At each node, the logarithm base 10 of the evidence ratio between the best-scoring and next-best scoring PWM is shown, which serves as a measure of confidence in the decision taken regarding classification at each step. In this case, the evidence ratio is simply the ratio of likelihoods between two PWMs.


MLgsc: A Maximum-Likelihood General Sequence Classifier.

Junier T, Hervé V, Wunderlin T, Junier P - PLoS ONE (2015)

Classifying.The aligned query sequence (a) is first scored by the matrices at the root’s direct children nodes, in this example the matrix for taxa 1 and 2 (b) and the one for taxa 3 and 4 (c). Matrix (b) is found to yield the better score (solid arrow). Therefore, the query is now scored against matrix (b)’s children, namely taxa 1 and 2. The former is found to yield the better score, and OTU 1 is reported as the most likely (d). The shaded parts of the tree (matrices for taxa 3 and 4) are never tested. In a balanced, fully bifurcating tree of n nodes, only 2log2(n) matrices are tested.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4492669&req=5

pone.0129384.g002: Classifying.The aligned query sequence (a) is first scored by the matrices at the root’s direct children nodes, in this example the matrix for taxa 1 and 2 (b) and the one for taxa 3 and 4 (c). Matrix (b) is found to yield the better score (solid arrow). Therefore, the query is now scored against matrix (b)’s children, namely taxa 1 and 2. The former is found to yield the better score, and OTU 1 is reported as the most likely (d). The shaded parts of the tree (matrices for taxa 3 and 4) are never tested. In a balanced, fully bifurcating tree of n nodes, only 2log2(n) matrices are tested.
Mentions: The query sequence is first aligned to the PWM at the root of the tree, using semi-global alignment (local in the query and global in the PWM). The method uses dynamic programming similar to Needleman and Wunsch’s algorithm [20], but with match/mismatch scores based on weights in the PWM instead of being fixed, and gaps disallowed in the PWM. The aligned query is thus exactly as long as the PWM. The aligned sequence is then scored against the matrices at the children nodes of the root. The matrix yielding the highest score is selected, and the sequence is now scored against the children of this matrix, and so on recursively until the highest-scoring matrix is at a leaf (Fig 2). The corresponding reference is reported as the most likely, together with the path through the tree. At each node, the logarithm base 10 of the evidence ratio between the best-scoring and next-best scoring PWM is shown, which serves as a measure of confidence in the decision taken regarding classification at each step. In this case, the evidence ratio is simply the ratio of likelihoods between two PWMs.

Bottom Line: The software was evaluated on all the 16S rRNA gene sequences of the reference dataset found in the GreenGenes database.Examples of applications based on the nitrogenase subunit NifH gene and a protein-coding gene found in endospore-forming Firmicutes is also presented.The package has minimal dependencies and thus can be easily integrated in command-line based classification pipelines.

View Article: PubMed Central - PubMed

Affiliation: Laboratory of Microbiology, University of Neuchâtel, Neuchâtel, Neuchâtel, Switzerland; Vital-IT Group, Swiss Institute of Bioinformatics, Lausanne, Vaud, Switzerland.

ABSTRACT
We present software package for classifying protein or nucleotide sequences to user-specified sets of reference sequences. The software trains a model using a multiple sequence alignment and a phylogenetic tree, both supplied by the user. The latter is used to guide model construction and as a decision tree to speed up the classification process. The software was evaluated on all the 16S rRNA gene sequences of the reference dataset found in the GreenGenes database. On this dataset, the software was shown to achieve an error rate of around 1% at genus level. Examples of applications based on the nitrogenase subunit NifH gene and a protein-coding gene found in endospore-forming Firmicutes is also presented. The programs in the package have a simple, straightforward command-line interface for the Unix shell, and are free and open-source. The package has minimal dependencies and thus can be easily integrated in command-line based classification pipelines.

No MeSH data available.