Limits...
MLgsc: A Maximum-Likelihood General Sequence Classifier.

Junier T, Hervé V, Wunderlin T, Junier P - PLoS ONE (2015)

Bottom Line: The software was evaluated on all the 16S rRNA gene sequences of the reference dataset found in the GreenGenes database.Examples of applications based on the nitrogenase subunit NifH gene and a protein-coding gene found in endospore-forming Firmicutes is also presented.The package has minimal dependencies and thus can be easily integrated in command-line based classification pipelines.

View Article: PubMed Central - PubMed

Affiliation: Laboratory of Microbiology, University of Neuchâtel, Neuchâtel, Neuchâtel, Switzerland; Vital-IT Group, Swiss Institute of Bioinformatics, Lausanne, Vaud, Switzerland.

ABSTRACT
We present software package for classifying protein or nucleotide sequences to user-specified sets of reference sequences. The software trains a model using a multiple sequence alignment and a phylogenetic tree, both supplied by the user. The latter is used to guide model construction and as a decision tree to speed up the classification process. The software was evaluated on all the 16S rRNA gene sequences of the reference dataset found in the GreenGenes database. On this dataset, the software was shown to achieve an error rate of around 1% at genus level. Examples of applications based on the nitrogenase subunit NifH gene and a protein-coding gene found in endospore-forming Firmicutes is also presented. The programs in the package have a simple, straightforward command-line interface for the Unix shell, and are free and open-source. The package has minimal dependencies and thus can be easily integrated in command-line based classification pipelines.

No MeSH data available.


Training the Model.A tree of position-specific weight matrices (bottom) constructed from a phylogeny (top left) and a multiple alignment (top right). Each taxon in the tree is represented by at least one (preferably more) sequence(s) in the alignment. Column i in a taxon matrix contains the relative frequencies of residues at position i, computed over all sequences of that taxon. Matrices at inner nodes are averages of the matrices of the node’s children. The tree need not be bifurcating, but a fully bifurcating tree offers the best speed performance (see Discussion).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4492669&req=5

pone.0129384.g001: Training the Model.A tree of position-specific weight matrices (bottom) constructed from a phylogeny (top left) and a multiple alignment (top right). Each taxon in the tree is represented by at least one (preferably more) sequence(s) in the alignment. Column i in a taxon matrix contains the relative frequencies of residues at position i, computed over all sequences of that taxon. Matrices at inner nodes are averages of the matrices of the node’s children. The tree need not be bifurcating, but a fully bifurcating tree offers the best speed performance (see Discussion).

Mentions: MLgsc constructs a tree of position-specific weight matrices (PWMs) using a multiple alignment of sequences from the classifying region and a phylogenetic tree of the reference taxa (Fig 1). The tree of PWMs has the same topology as the phylogeny of the taxa, and a matrix in the PWM tree models a clade in the phylogeny: a matrix at a tip of the tree models a single taxon, while a matrix in an internal tree node models a clade of two or more taxa. Column i in a matrix contains the relative frequencies of residues at position i, computed over all reference sequences belonging to the clade modeled by the matrix. It therefore represents a maximum-likelihood estimate of the true frequencies in the wild. Gaps in the alignment are simply considered an additional character, so the matrix models the probability of occurrence of a deletion with respect to the alignment.


MLgsc: A Maximum-Likelihood General Sequence Classifier.

Junier T, Hervé V, Wunderlin T, Junier P - PLoS ONE (2015)

Training the Model.A tree of position-specific weight matrices (bottom) constructed from a phylogeny (top left) and a multiple alignment (top right). Each taxon in the tree is represented by at least one (preferably more) sequence(s) in the alignment. Column i in a taxon matrix contains the relative frequencies of residues at position i, computed over all sequences of that taxon. Matrices at inner nodes are averages of the matrices of the node’s children. The tree need not be bifurcating, but a fully bifurcating tree offers the best speed performance (see Discussion).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4492669&req=5

pone.0129384.g001: Training the Model.A tree of position-specific weight matrices (bottom) constructed from a phylogeny (top left) and a multiple alignment (top right). Each taxon in the tree is represented by at least one (preferably more) sequence(s) in the alignment. Column i in a taxon matrix contains the relative frequencies of residues at position i, computed over all sequences of that taxon. Matrices at inner nodes are averages of the matrices of the node’s children. The tree need not be bifurcating, but a fully bifurcating tree offers the best speed performance (see Discussion).
Mentions: MLgsc constructs a tree of position-specific weight matrices (PWMs) using a multiple alignment of sequences from the classifying region and a phylogenetic tree of the reference taxa (Fig 1). The tree of PWMs has the same topology as the phylogeny of the taxa, and a matrix in the PWM tree models a clade in the phylogeny: a matrix at a tip of the tree models a single taxon, while a matrix in an internal tree node models a clade of two or more taxa. Column i in a matrix contains the relative frequencies of residues at position i, computed over all reference sequences belonging to the clade modeled by the matrix. It therefore represents a maximum-likelihood estimate of the true frequencies in the wild. Gaps in the alignment are simply considered an additional character, so the matrix models the probability of occurrence of a deletion with respect to the alignment.

Bottom Line: The software was evaluated on all the 16S rRNA gene sequences of the reference dataset found in the GreenGenes database.Examples of applications based on the nitrogenase subunit NifH gene and a protein-coding gene found in endospore-forming Firmicutes is also presented.The package has minimal dependencies and thus can be easily integrated in command-line based classification pipelines.

View Article: PubMed Central - PubMed

Affiliation: Laboratory of Microbiology, University of Neuchâtel, Neuchâtel, Neuchâtel, Switzerland; Vital-IT Group, Swiss Institute of Bioinformatics, Lausanne, Vaud, Switzerland.

ABSTRACT
We present software package for classifying protein or nucleotide sequences to user-specified sets of reference sequences. The software trains a model using a multiple sequence alignment and a phylogenetic tree, both supplied by the user. The latter is used to guide model construction and as a decision tree to speed up the classification process. The software was evaluated on all the 16S rRNA gene sequences of the reference dataset found in the GreenGenes database. On this dataset, the software was shown to achieve an error rate of around 1% at genus level. Examples of applications based on the nitrogenase subunit NifH gene and a protein-coding gene found in endospore-forming Firmicutes is also presented. The programs in the package have a simple, straightforward command-line interface for the Unix shell, and are free and open-source. The package has minimal dependencies and thus can be easily integrated in command-line based classification pipelines.

No MeSH data available.