Limits...
Ultra-large alignments using phylogeny-aware profiles.

Nguyen NP, Mirarab S, Kumar K, Warnow T - Genome Biol. (2015)

Bottom Line: Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets.However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences.UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences.

View Article: PubMed Central - PubMed

Affiliation: Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, 61801, Illinois, USA. namphuon@illinois.edu.

ABSTRACT
Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp .

No MeSH data available.


Running time for UPP (Fast) for the RNASim datasets. We show the running time to generate an alignment for UPP (Fast) for RNASim datasets with 10K, 50K, 100K, and 200K sequences. All analyses were run on TACC with 24 GB of memory and 12 CPUs
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4492008&req=5

Fig6: Running time for UPP (Fast) for the RNASim datasets. We show the running time to generate an alignment for UPP (Fast) for RNASim datasets with 10K, 50K, 100K, and 200K sequences. All analyses were run on TACC with 24 GB of memory and 12 CPUs

Mentions: We then explored how UPP’s running time (measured using the wall-clock time) scaled with the size of the dataset by exploring subsets of the RNASim dataset with 10,000 to 200,000 sequences, using 12 cores. Running times for UPP (Fast) for the RNASim datasets showed a close to linear trend, so that UPP (Fast) completed for 10K sequences in 55 minutes, 50K sequences in 4.2 hours, 100K sequences in about 8.5 hours, and 200K sequences in about 17.8 hours (Fig. 6).Fig. 6


Ultra-large alignments using phylogeny-aware profiles.

Nguyen NP, Mirarab S, Kumar K, Warnow T - Genome Biol. (2015)

Running time for UPP (Fast) for the RNASim datasets. We show the running time to generate an alignment for UPP (Fast) for RNASim datasets with 10K, 50K, 100K, and 200K sequences. All analyses were run on TACC with 24 GB of memory and 12 CPUs
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4492008&req=5

Fig6: Running time for UPP (Fast) for the RNASim datasets. We show the running time to generate an alignment for UPP (Fast) for RNASim datasets with 10K, 50K, 100K, and 200K sequences. All analyses were run on TACC with 24 GB of memory and 12 CPUs
Mentions: We then explored how UPP’s running time (measured using the wall-clock time) scaled with the size of the dataset by exploring subsets of the RNASim dataset with 10,000 to 200,000 sequences, using 12 cores. Running times for UPP (Fast) for the RNASim datasets showed a close to linear trend, so that UPP (Fast) completed for 10K sequences in 55 minutes, 50K sequences in 4.2 hours, 100K sequences in about 8.5 hours, and 200K sequences in about 17.8 hours (Fig. 6).Fig. 6

Bottom Line: Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets.However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences.UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences.

View Article: PubMed Central - PubMed

Affiliation: Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, 61801, Illinois, USA. namphuon@illinois.edu.

ABSTRACT
Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp .

No MeSH data available.