Limits...
Adaptive GDDA-BLAST: fast and efficient algorithm for protein sequence embedding.

Hong Y, Kang J, Lee D, van Rossum DB - PLoS ONE (2010)

Bottom Line: Herein, we describe the logic and algorithmic process for a heuristic embedding strategy named "Adaptive GDDA-BLAST." Adaptive GDDA-BLAST is, on average, up to 19 times faster than, but has similar sensitivity to our previous method.Further, data are provided to demonstrate the benefits of embedded-alignment measurements in terms of detecting structural homology in highly divergent protein sequences and isolating secondary structural elements of transmembrane and ankyrin-repeat domains.Together, these advances allow further exploration of the embedded alignment data space within sufficiently large data sets to eventually induce relevant statistical inferences.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America.

ABSTRACT
A major computational challenge in the genomic era is annotating structure/function to the vast quantities of sequence information that is now available. This problem is illustrated by the fact that most proteins lack comprehensive annotations, even when experimental evidence exists. We previously theorized that embedded-alignment profiles (simply "alignment profiles" hereafter) provide a quantitative method that is capable of relating the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of alignment profiles lies in the interoperability of data format (e.g., alignment information, physio-chemical information, genomic information, etc.). Indeed, we have demonstrated that the Position Specific Scoring Matrices (PSSMs) are an informative M-dimension that is scored by quantitatively measuring the embedded or unmodified sequence alignments. Moreover, the information obtained from these alignments is informative, and remains so even in the "twilight zone" of sequence similarity (<25% identity). Although our previous embedding strategy was powerful, it suffered from contaminating alignments (embedded AND unmodified) and high computational costs. Herein, we describe the logic and algorithmic process for a heuristic embedding strategy named "Adaptive GDDA-BLAST." Adaptive GDDA-BLAST is, on average, up to 19 times faster than, but has similar sensitivity to our previous method. Further, data are provided to demonstrate the benefits of embedded-alignment measurements in terms of detecting structural homology in highly divergent protein sequences and isolating secondary structural elements of transmembrane and ankyrin-repeat domains. Together, these advances allow further exploration of the embedded alignment data space within sufficiently large data sets to eventually induce relevant statistical inferences. We show that sequence embedding could serve as one of the vehicles for measurement of low-identity alignments and for incorporation thereof into high-performance PSSM-based alignment profiles.

Show MeSH

Related in: MedlinePlus

Example of embedded alignments.The seeded alignments for three consecutive chimera sequences. The query and the target sequences are general transcription factor II, i isoform from Homo sapiens (NP001509.2) and ML (MD2related lipidrecognition) domain (cd00912), respectively.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2962639&req=5

pone-0013596-g006: Example of embedded alignments.The seeded alignments for three consecutive chimera sequences. The query and the target sequences are general transcription factor II, i isoform from Homo sapiens (NP001509.2) and ML (MD2related lipidrecognition) domain (cd00912), respectively.

Mentions: Gapped extension in rps-BLAST starts at a GE starting pair that is a central residue pair in the highest scoring segment of any HSP whose score is sufficiently high. Different alignments can be generated, if the gapped extension is performed on different GE starting pairs and there is no guarantee that the same GE starting pair is selected for different chimera sequences. However, if a portion of a target sequence is conserved in a query sequence, then it is very likely that the conserved region is aligned for multiple neighboring chimera sequences. We exploit this property to speed up the alignment process (Figure 6).


Adaptive GDDA-BLAST: fast and efficient algorithm for protein sequence embedding.

Hong Y, Kang J, Lee D, van Rossum DB - PLoS ONE (2010)

Example of embedded alignments.The seeded alignments for three consecutive chimera sequences. The query and the target sequences are general transcription factor II, i isoform from Homo sapiens (NP001509.2) and ML (MD2related lipidrecognition) domain (cd00912), respectively.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2962639&req=5

pone-0013596-g006: Example of embedded alignments.The seeded alignments for three consecutive chimera sequences. The query and the target sequences are general transcription factor II, i isoform from Homo sapiens (NP001509.2) and ML (MD2related lipidrecognition) domain (cd00912), respectively.
Mentions: Gapped extension in rps-BLAST starts at a GE starting pair that is a central residue pair in the highest scoring segment of any HSP whose score is sufficiently high. Different alignments can be generated, if the gapped extension is performed on different GE starting pairs and there is no guarantee that the same GE starting pair is selected for different chimera sequences. However, if a portion of a target sequence is conserved in a query sequence, then it is very likely that the conserved region is aligned for multiple neighboring chimera sequences. We exploit this property to speed up the alignment process (Figure 6).

Bottom Line: Herein, we describe the logic and algorithmic process for a heuristic embedding strategy named "Adaptive GDDA-BLAST." Adaptive GDDA-BLAST is, on average, up to 19 times faster than, but has similar sensitivity to our previous method.Further, data are provided to demonstrate the benefits of embedded-alignment measurements in terms of detecting structural homology in highly divergent protein sequences and isolating secondary structural elements of transmembrane and ankyrin-repeat domains.Together, these advances allow further exploration of the embedded alignment data space within sufficiently large data sets to eventually induce relevant statistical inferences.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America.

ABSTRACT
A major computational challenge in the genomic era is annotating structure/function to the vast quantities of sequence information that is now available. This problem is illustrated by the fact that most proteins lack comprehensive annotations, even when experimental evidence exists. We previously theorized that embedded-alignment profiles (simply "alignment profiles" hereafter) provide a quantitative method that is capable of relating the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of alignment profiles lies in the interoperability of data format (e.g., alignment information, physio-chemical information, genomic information, etc.). Indeed, we have demonstrated that the Position Specific Scoring Matrices (PSSMs) are an informative M-dimension that is scored by quantitatively measuring the embedded or unmodified sequence alignments. Moreover, the information obtained from these alignments is informative, and remains so even in the "twilight zone" of sequence similarity (<25% identity). Although our previous embedding strategy was powerful, it suffered from contaminating alignments (embedded AND unmodified) and high computational costs. Herein, we describe the logic and algorithmic process for a heuristic embedding strategy named "Adaptive GDDA-BLAST." Adaptive GDDA-BLAST is, on average, up to 19 times faster than, but has similar sensitivity to our previous method. Further, data are provided to demonstrate the benefits of embedded-alignment measurements in terms of detecting structural homology in highly divergent protein sequences and isolating secondary structural elements of transmembrane and ankyrin-repeat domains. Together, these advances allow further exploration of the embedded alignment data space within sufficiently large data sets to eventually induce relevant statistical inferences. We show that sequence embedding could serve as one of the vehicles for measurement of low-identity alignments and for incorporation thereof into high-performance PSSM-based alignment profiles.

Show MeSH
Related in: MedlinePlus