Limits...
Ultra-large alignments using phylogeny-aware profiles.

Nguyen NP, Mirarab S, Kumar K, Warnow T - Genome Biol. (2015)

Bottom Line: Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets.However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences.UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences.

View Article: PubMed Central - PubMed

Affiliation: Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, 61801, Illinois, USA. namphuon@illinois.edu.

ABSTRACT
Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp .

No MeSH data available.


Comparison for the RNASim 200K dataset. We show a alignment SP-error, b FN tree error, and cΔFN tree error rates for RNASim datasets with up to 200K sequences. Results not shown are due to methods failing to return an alignment within the 24-hour time period on TACC using 12 cores. ML trees were estimated using FastTree under the general time reversible model
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4492008&req=5

Fig5: Comparison for the RNASim 200K dataset. We show a alignment SP-error, b FN tree error, and cΔFN tree error rates for RNASim datasets with up to 200K sequences. Results not shown are due to methods failing to return an alignment within the 24-hour time period on TACC using 12 cores. ML trees were estimated using FastTree under the general time reversible model

Mentions: We evaluated the ability of different methods to analyze very large datasets (up to one million sequences), using subsets of the million-sequence RNASim dataset; this comparison also reveals the impact of taxon sampling on the alignment methods. We examined performance for UPP (Fast), the fast version of UPP that differs from the default setting of UPP only in that it uses smaller backbones (100 sequences instead of 1000). Figure 5 shows results for 10,000 to 200,000 sequences, and compares UPP (Fast), UPP (Default), PASTA, MAFFT, Muscle, and Clustal-Omega, limiting analyses to 24 hours on a 12-core 24 Gb machine. While all methods shown were able to complete analyses for the 10K dataset, only UPP (Fast) and PASTA completed analyses for the 100K and 200K datasets.Fig. 5


Ultra-large alignments using phylogeny-aware profiles.

Nguyen NP, Mirarab S, Kumar K, Warnow T - Genome Biol. (2015)

Comparison for the RNASim 200K dataset. We show a alignment SP-error, b FN tree error, and cΔFN tree error rates for RNASim datasets with up to 200K sequences. Results not shown are due to methods failing to return an alignment within the 24-hour time period on TACC using 12 cores. ML trees were estimated using FastTree under the general time reversible model
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4492008&req=5

Fig5: Comparison for the RNASim 200K dataset. We show a alignment SP-error, b FN tree error, and cΔFN tree error rates for RNASim datasets with up to 200K sequences. Results not shown are due to methods failing to return an alignment within the 24-hour time period on TACC using 12 cores. ML trees were estimated using FastTree under the general time reversible model
Mentions: We evaluated the ability of different methods to analyze very large datasets (up to one million sequences), using subsets of the million-sequence RNASim dataset; this comparison also reveals the impact of taxon sampling on the alignment methods. We examined performance for UPP (Fast), the fast version of UPP that differs from the default setting of UPP only in that it uses smaller backbones (100 sequences instead of 1000). Figure 5 shows results for 10,000 to 200,000 sequences, and compares UPP (Fast), UPP (Default), PASTA, MAFFT, Muscle, and Clustal-Omega, limiting analyses to 24 hours on a 12-core 24 Gb machine. While all methods shown were able to complete analyses for the 10K dataset, only UPP (Fast) and PASTA completed analyses for the 100K and 200K datasets.Fig. 5

Bottom Line: Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets.However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences.UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences.

View Article: PubMed Central - PubMed

Affiliation: Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, 61801, Illinois, USA. namphuon@illinois.edu.

ABSTRACT
Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp .

No MeSH data available.