Limits...
Ultra-large alignments using phylogeny-aware profiles.

Nguyen NP, Mirarab S, Kumar K, Warnow T - Genome Biol. (2015)

Bottom Line: Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets.However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences.UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences.

View Article: PubMed Central - PubMed

Affiliation: Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, 61801, Illinois, USA. namphuon@illinois.edu.

ABSTRACT
Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp .

No MeSH data available.


Related in: MedlinePlus

Overview of the UPP algorithm. The input is a set of aligned sequences. This sequence dataset is split into two parts: the backbone dataset and the set of query sequences. An alignment and tree are estimated for the backbone dataset, and an ensemble of HMMs is constructed based on the backbone alignment and tree. The query sequences are then aligned to each HMM, and the best scoring HMM for each sequence is used to add the query sequence to the backbone alignment. See text for more details
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4492008&req=5

Fig2: Overview of the UPP algorithm. The input is a set of aligned sequences. This sequence dataset is split into two parts: the backbone dataset and the set of query sequences. An alignment and tree are estimated for the backbone dataset, and an ensemble of HMMs is constructed based on the backbone alignment and tree. The query sequences are then aligned to each HMM, and the best scoring HMM for each sequence is used to add the query sequence to the backbone alignment. See text for more details

Mentions: UPP uses the HMMER [13] suite of tools (see “Materials and methods”) to produce an alignment, and builds on ideas in SEPP [14]. The basic idea behind UPP is to estimate an accurate alignment for a subset of the sequences and align the remaining sequences to the alignment using profile HMMs [15]. UPP has four phases (see Fig. 2).Fig. 2


Ultra-large alignments using phylogeny-aware profiles.

Nguyen NP, Mirarab S, Kumar K, Warnow T - Genome Biol. (2015)

Overview of the UPP algorithm. The input is a set of aligned sequences. This sequence dataset is split into two parts: the backbone dataset and the set of query sequences. An alignment and tree are estimated for the backbone dataset, and an ensemble of HMMs is constructed based on the backbone alignment and tree. The query sequences are then aligned to each HMM, and the best scoring HMM for each sequence is used to add the query sequence to the backbone alignment. See text for more details
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4492008&req=5

Fig2: Overview of the UPP algorithm. The input is a set of aligned sequences. This sequence dataset is split into two parts: the backbone dataset and the set of query sequences. An alignment and tree are estimated for the backbone dataset, and an ensemble of HMMs is constructed based on the backbone alignment and tree. The query sequences are then aligned to each HMM, and the best scoring HMM for each sequence is used to add the query sequence to the backbone alignment. See text for more details
Mentions: UPP uses the HMMER [13] suite of tools (see “Materials and methods”) to produce an alignment, and builds on ideas in SEPP [14]. The basic idea behind UPP is to estimate an accurate alignment for a subset of the sequences and align the remaining sequences to the alignment using profile HMMs [15]. UPP has four phases (see Fig. 2).Fig. 2

Bottom Line: Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets.However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences.UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences.

View Article: PubMed Central - PubMed

Affiliation: Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, 61801, Illinois, USA. namphuon@illinois.edu.

ABSTRACT
Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp .

No MeSH data available.


Related in: MedlinePlus