Limits...
Adaptive GDDA-BLAST: fast and efficient algorithm for protein sequence embedding.

Hong Y, Kang J, Lee D, van Rossum DB - PLoS ONE (2010)

Bottom Line: Herein, we describe the logic and algorithmic process for a heuristic embedding strategy named "Adaptive GDDA-BLAST." Adaptive GDDA-BLAST is, on average, up to 19 times faster than, but has similar sensitivity to our previous method.Further, data are provided to demonstrate the benefits of embedded-alignment measurements in terms of detecting structural homology in highly divergent protein sequences and isolating secondary structural elements of transmembrane and ankyrin-repeat domains.Together, these advances allow further exploration of the embedded alignment data space within sufficiently large data sets to eventually induce relevant statistical inferences.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America.

ABSTRACT
A major computational challenge in the genomic era is annotating structure/function to the vast quantities of sequence information that is now available. This problem is illustrated by the fact that most proteins lack comprehensive annotations, even when experimental evidence exists. We previously theorized that embedded-alignment profiles (simply "alignment profiles" hereafter) provide a quantitative method that is capable of relating the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of alignment profiles lies in the interoperability of data format (e.g., alignment information, physio-chemical information, genomic information, etc.). Indeed, we have demonstrated that the Position Specific Scoring Matrices (PSSMs) are an informative M-dimension that is scored by quantitatively measuring the embedded or unmodified sequence alignments. Moreover, the information obtained from these alignments is informative, and remains so even in the "twilight zone" of sequence similarity (<25% identity). Although our previous embedding strategy was powerful, it suffered from contaminating alignments (embedded AND unmodified) and high computational costs. Herein, we describe the logic and algorithmic process for a heuristic embedding strategy named "Adaptive GDDA-BLAST." Adaptive GDDA-BLAST is, on average, up to 19 times faster than, but has similar sensitivity to our previous method. Further, data are provided to demonstrate the benefits of embedded-alignment measurements in terms of detecting structural homology in highly divergent protein sequences and isolating secondary structural elements of transmembrane and ankyrin-repeat domains. Together, these advances allow further exploration of the embedded alignment data space within sufficiently large data sets to eventually induce relevant statistical inferences. We show that sequence embedding could serve as one of the vehicles for measurement of low-identity alignments and for incorporation thereof into high-performance PSSM-based alignment profiles.

Show MeSH
The Concept of GDDA-BLAST and Adaptive GDDA-BLAST.This schematic depicts the work flow of GDDA-BLAST and Adaptive GDDA-BLAST (i–ii) The algorithm begins with a modification of the query amino acid sequence via the insertion of a “seed” sequence from the profile of interest. These seeds are obtained from the profile consensus sequences from NCBI's Conserved Domain Database (CDD). GDDA-BLAST inserts a seed at every query amino acid position; in constrast, Adaptive GDDA-BLAST inserts a seed at the positions where the seed is likely to be extended to an alignment. (iii–iv) The signals are collected from the optimal alignments between the “embedded” sequences and profiles using rps-BLAST or Adaptive GDDA-BLAST; and, they are incorporated as a composite score into an N by M data matrix. (v–ix) This dataspace can be analyzed to generate phylograms and dendrograms based on the Euclidean distance and Pearson correlation measures on alignment profiles of query proteins, respectively.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2962639&req=5

pone-0013596-g001: The Concept of GDDA-BLAST and Adaptive GDDA-BLAST.This schematic depicts the work flow of GDDA-BLAST and Adaptive GDDA-BLAST (i–ii) The algorithm begins with a modification of the query amino acid sequence via the insertion of a “seed” sequence from the profile of interest. These seeds are obtained from the profile consensus sequences from NCBI's Conserved Domain Database (CDD). GDDA-BLAST inserts a seed at every query amino acid position; in constrast, Adaptive GDDA-BLAST inserts a seed at the positions where the seed is likely to be extended to an alignment. (iii–iv) The signals are collected from the optimal alignments between the “embedded” sequences and profiles using rps-BLAST or Adaptive GDDA-BLAST; and, they are incorporated as a composite score into an N by M data matrix. (v–ix) This dataspace can be analyzed to generate phylograms and dendrograms based on the Euclidean distance and Pearson correlation measures on alignment profiles of query proteins, respectively.

Mentions: A phylogenetic profile represents a protein as a vector where each entry quantifies the existence of the protein in a different genomes [17]–[19]. This approach has been proven applicable to whole molecules (Single Profile Method), to isolated domains (Multiple Profile Method) and to individual amino acids. Similar to phylogenetic profiles, our embedded-alignment profiles present a protein as a vector where each entry quantifies the existence of alignments with a PSSM [1], [2]. The basic idea underlying our method begins by compiling a set of PSSMs that the query sequence is compared to. These profiles are obtainable from any protein-sequence knowledge-base source (e.g., Protein Data Bank, Pfam, SMART, and NCBI Conserved Domain Database (CDD)) [20]–[23], or they are locally creatible through PSI-BLAST [24]. We employ the reverse specific position BLAST (rps-BLAST) [23] to compare the query with PSSMs. A single domain PSSM database is utilized for pairwise comparison. We then record and quantify all alignments between an unmodified (control) query sequence and a modified one. The latter is composed of two types of alignments: “seeded” and “non-seeded” alignments. We modify the query sequence with a “seed” from the PSSM, and, thereby, create a consistent initiation site (Figure 1-i, ii). The “seeds” are generated from the profiles by taking a portion (e.g., 10% in this study) of the PSSM sequence (e.g., from the N-terminus or C-terminus). These thresholds have been adopted herein from the results of our previous studies. These “seeds” are embedded at each position of the query sequence. For example, a query sequence of 100 amino acids yields 100 distinct test sequences for each “seed”. This strategy is designed to amplify and encode all the alignments possible for any given query sequence. In place of a sliding window, we utilized a sliding “seed,” which is similar, yet inverse to the embedding strategies employed by Henikoff and Henikoff [25].


Adaptive GDDA-BLAST: fast and efficient algorithm for protein sequence embedding.

Hong Y, Kang J, Lee D, van Rossum DB - PLoS ONE (2010)

The Concept of GDDA-BLAST and Adaptive GDDA-BLAST.This schematic depicts the work flow of GDDA-BLAST and Adaptive GDDA-BLAST (i–ii) The algorithm begins with a modification of the query amino acid sequence via the insertion of a “seed” sequence from the profile of interest. These seeds are obtained from the profile consensus sequences from NCBI's Conserved Domain Database (CDD). GDDA-BLAST inserts a seed at every query amino acid position; in constrast, Adaptive GDDA-BLAST inserts a seed at the positions where the seed is likely to be extended to an alignment. (iii–iv) The signals are collected from the optimal alignments between the “embedded” sequences and profiles using rps-BLAST or Adaptive GDDA-BLAST; and, they are incorporated as a composite score into an N by M data matrix. (v–ix) This dataspace can be analyzed to generate phylograms and dendrograms based on the Euclidean distance and Pearson correlation measures on alignment profiles of query proteins, respectively.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2962639&req=5

pone-0013596-g001: The Concept of GDDA-BLAST and Adaptive GDDA-BLAST.This schematic depicts the work flow of GDDA-BLAST and Adaptive GDDA-BLAST (i–ii) The algorithm begins with a modification of the query amino acid sequence via the insertion of a “seed” sequence from the profile of interest. These seeds are obtained from the profile consensus sequences from NCBI's Conserved Domain Database (CDD). GDDA-BLAST inserts a seed at every query amino acid position; in constrast, Adaptive GDDA-BLAST inserts a seed at the positions where the seed is likely to be extended to an alignment. (iii–iv) The signals are collected from the optimal alignments between the “embedded” sequences and profiles using rps-BLAST or Adaptive GDDA-BLAST; and, they are incorporated as a composite score into an N by M data matrix. (v–ix) This dataspace can be analyzed to generate phylograms and dendrograms based on the Euclidean distance and Pearson correlation measures on alignment profiles of query proteins, respectively.
Mentions: A phylogenetic profile represents a protein as a vector where each entry quantifies the existence of the protein in a different genomes [17]–[19]. This approach has been proven applicable to whole molecules (Single Profile Method), to isolated domains (Multiple Profile Method) and to individual amino acids. Similar to phylogenetic profiles, our embedded-alignment profiles present a protein as a vector where each entry quantifies the existence of alignments with a PSSM [1], [2]. The basic idea underlying our method begins by compiling a set of PSSMs that the query sequence is compared to. These profiles are obtainable from any protein-sequence knowledge-base source (e.g., Protein Data Bank, Pfam, SMART, and NCBI Conserved Domain Database (CDD)) [20]–[23], or they are locally creatible through PSI-BLAST [24]. We employ the reverse specific position BLAST (rps-BLAST) [23] to compare the query with PSSMs. A single domain PSSM database is utilized for pairwise comparison. We then record and quantify all alignments between an unmodified (control) query sequence and a modified one. The latter is composed of two types of alignments: “seeded” and “non-seeded” alignments. We modify the query sequence with a “seed” from the PSSM, and, thereby, create a consistent initiation site (Figure 1-i, ii). The “seeds” are generated from the profiles by taking a portion (e.g., 10% in this study) of the PSSM sequence (e.g., from the N-terminus or C-terminus). These thresholds have been adopted herein from the results of our previous studies. These “seeds” are embedded at each position of the query sequence. For example, a query sequence of 100 amino acids yields 100 distinct test sequences for each “seed”. This strategy is designed to amplify and encode all the alignments possible for any given query sequence. In place of a sliding window, we utilized a sliding “seed,” which is similar, yet inverse to the embedding strategies employed by Henikoff and Henikoff [25].

Bottom Line: Herein, we describe the logic and algorithmic process for a heuristic embedding strategy named "Adaptive GDDA-BLAST." Adaptive GDDA-BLAST is, on average, up to 19 times faster than, but has similar sensitivity to our previous method.Further, data are provided to demonstrate the benefits of embedded-alignment measurements in terms of detecting structural homology in highly divergent protein sequences and isolating secondary structural elements of transmembrane and ankyrin-repeat domains.Together, these advances allow further exploration of the embedded alignment data space within sufficiently large data sets to eventually induce relevant statistical inferences.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America.

ABSTRACT
A major computational challenge in the genomic era is annotating structure/function to the vast quantities of sequence information that is now available. This problem is illustrated by the fact that most proteins lack comprehensive annotations, even when experimental evidence exists. We previously theorized that embedded-alignment profiles (simply "alignment profiles" hereafter) provide a quantitative method that is capable of relating the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of alignment profiles lies in the interoperability of data format (e.g., alignment information, physio-chemical information, genomic information, etc.). Indeed, we have demonstrated that the Position Specific Scoring Matrices (PSSMs) are an informative M-dimension that is scored by quantitatively measuring the embedded or unmodified sequence alignments. Moreover, the information obtained from these alignments is informative, and remains so even in the "twilight zone" of sequence similarity (<25% identity). Although our previous embedding strategy was powerful, it suffered from contaminating alignments (embedded AND unmodified) and high computational costs. Herein, we describe the logic and algorithmic process for a heuristic embedding strategy named "Adaptive GDDA-BLAST." Adaptive GDDA-BLAST is, on average, up to 19 times faster than, but has similar sensitivity to our previous method. Further, data are provided to demonstrate the benefits of embedded-alignment measurements in terms of detecting structural homology in highly divergent protein sequences and isolating secondary structural elements of transmembrane and ankyrin-repeat domains. Together, these advances allow further exploration of the embedded alignment data space within sufficiently large data sets to eventually induce relevant statistical inferences. We show that sequence embedding could serve as one of the vehicles for measurement of low-identity alignments and for incorporation thereof into high-performance PSSM-based alignment profiles.

Show MeSH