Limits...
Ultra-large alignments using phylogeny-aware profiles.

Nguyen NP, Mirarab S, Kumar K, Warnow T - Genome Biol. (2015)

Bottom Line: Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets.However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences.UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences.

View Article: PubMed Central - PubMed

Affiliation: Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, 61801, Illinois, USA. namphuon@illinois.edu.

ABSTRACT
Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp .

No MeSH data available.


Related in: MedlinePlus

Comparison of UPP variants on fragmentary datasets. We show average a alignment error and bΔFN tree error for UPP (Default), UPP (Default, NoDecomp), UPP-random (Default), and UPP-random (Default, NoDecomp) for the fragmentary datasets. The backbone is not restricted to full-length sequences in UPP-random, and so it allows fragmentary sequences in the backbone set. UPP-random (Default, NoDecomp) failed to align at least one dataset from each of the RNASim 10K, Indelible 10K, and CRW model conditions. UPP (Default, NoDecomp) failed to align at least one dataset from each of the ROSE NT, RNASim 10K, and Indelible 10K model conditions. ML trees were estimated using FastTree under the general time reversible model
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4492008&req=5

Fig4: Comparison of UPP variants on fragmentary datasets. We show average a alignment error and bΔFN tree error for UPP (Default), UPP (Default, NoDecomp), UPP-random (Default), and UPP-random (Default, NoDecomp) for the fragmentary datasets. The backbone is not restricted to full-length sequences in UPP-random, and so it allows fragmentary sequences in the backbone set. UPP-random (Default, NoDecomp) failed to align at least one dataset from each of the RNASim 10K, Indelible 10K, and CRW model conditions. UPP (Default, NoDecomp) failed to align at least one dataset from each of the ROSE NT, RNASim 10K, and Indelible 10K model conditions. ML trees were estimated using FastTree under the general time reversible model

Mentions: To understand better why UPP is robust to fragmentation, we explored UPP variants (called UPP-random) in which we did not constrain the backbone to be only full-length sequences. We also looked at whether using the ensemble of HMMs instead of a single HMM contributes to robustness to fragmentation. These comparisons (Fig. 4) revealed some interesting trends about the impact of these algorithm design parameters. First, the only UPP variants that were able to align all the datasets were the two that used the ensemble of HMMs; the variants that used a single HMM each failed to align several datasets because HMMER was not able to align some of the query sequences to the backbone alignment (Fig. 4).Fig. 4


Ultra-large alignments using phylogeny-aware profiles.

Nguyen NP, Mirarab S, Kumar K, Warnow T - Genome Biol. (2015)

Comparison of UPP variants on fragmentary datasets. We show average a alignment error and bΔFN tree error for UPP (Default), UPP (Default, NoDecomp), UPP-random (Default), and UPP-random (Default, NoDecomp) for the fragmentary datasets. The backbone is not restricted to full-length sequences in UPP-random, and so it allows fragmentary sequences in the backbone set. UPP-random (Default, NoDecomp) failed to align at least one dataset from each of the RNASim 10K, Indelible 10K, and CRW model conditions. UPP (Default, NoDecomp) failed to align at least one dataset from each of the ROSE NT, RNASim 10K, and Indelible 10K model conditions. ML trees were estimated using FastTree under the general time reversible model
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4492008&req=5

Fig4: Comparison of UPP variants on fragmentary datasets. We show average a alignment error and bΔFN tree error for UPP (Default), UPP (Default, NoDecomp), UPP-random (Default), and UPP-random (Default, NoDecomp) for the fragmentary datasets. The backbone is not restricted to full-length sequences in UPP-random, and so it allows fragmentary sequences in the backbone set. UPP-random (Default, NoDecomp) failed to align at least one dataset from each of the RNASim 10K, Indelible 10K, and CRW model conditions. UPP (Default, NoDecomp) failed to align at least one dataset from each of the ROSE NT, RNASim 10K, and Indelible 10K model conditions. ML trees were estimated using FastTree under the general time reversible model
Mentions: To understand better why UPP is robust to fragmentation, we explored UPP variants (called UPP-random) in which we did not constrain the backbone to be only full-length sequences. We also looked at whether using the ensemble of HMMs instead of a single HMM contributes to robustness to fragmentation. These comparisons (Fig. 4) revealed some interesting trends about the impact of these algorithm design parameters. First, the only UPP variants that were able to align all the datasets were the two that used the ensemble of HMMs; the variants that used a single HMM each failed to align several datasets because HMMER was not able to align some of the query sequences to the backbone alignment (Fig. 4).Fig. 4

Bottom Line: Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets.However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences.UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences.

View Article: PubMed Central - PubMed

Affiliation: Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, 61801, Illinois, USA. namphuon@illinois.edu.

ABSTRACT
Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp .

No MeSH data available.


Related in: MedlinePlus