Limits...
Identification of antifreeze proteins and their functional residues by support vector machine and genetic algorithms based on n-peptide compositions.

Yu CS, Lu CH - PLoS ONE (2011)

Bottom Line: For the first time, multiple sets of n-peptide compositions from antifreeze protein (AFP) sequences of various cold-adapted fish and insects were analyzed using support vector machine and genetic algorithms.Our results reveal that it is feasible to identify the shared sequential features among the various structural types of AFPs.This approach should be useful for genomic and proteomic studies involving cold-adapted organisms.

View Article: PubMed Central - PubMed

Affiliation: Department of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan. yucs@fcu.edu.tw

ABSTRACT
For the first time, multiple sets of n-peptide compositions from antifreeze protein (AFP) sequences of various cold-adapted fish and insects were analyzed using support vector machine and genetic algorithms. The identification of AFPs is difficult because they exist as evolutionarily divergent types, and because their sequences and structures are present in limited numbers in currently available databases. Our results reveal that it is feasible to identify the shared sequential features among the various structural types of AFPs. Moreover, we were able to identify residues involved in ice binding without requiring knowledge of the three-dimensional structures of these AFPs. This approach should be useful for genomic and proteomic studies involving cold-adapted organisms.

Show MeSH
Sequence identity distribution for pairs of AFPs.The x-axis values are the best pairwise-matched SI values for each AFP sequence against the other 368 sequences. The y-axis values are the best pairwise-matched SI values for each of the 369 AFP sequences of the second independent dataset against the 44 sequences of the validation set. Whether an AFP is identified (black symbol) or not identified (red symbol) in the independent data is indicated.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3105057&req=5

pone-0020445-g001: Sequence identity distribution for pairs of AFPs.The x-axis values are the best pairwise-matched SI values for each AFP sequence against the other 368 sequences. The y-axis values are the best pairwise-matched SI values for each of the 369 AFP sequences of the second independent dataset against the 44 sequences of the validation set. Whether an AFP is identified (black symbol) or not identified (red symbol) in the independent data is indicated.

Mentions: The AFPs of the second independent dataset represent a divergent group of organisms and were collected from the UniProKB database [25], [26], about 57% of these proteins were identified as AFPs by the SVMGA. The SI pair distribution, which characterizes the relative number of sequence pairs in the close percentage sequence identity interval, was used to examine the effect of sequence homology on AFP identification. The 369 AFP sequences were each used as a query sequence to profile the SI pair-distribution. Each query sequence was aligned with the 44 AFPs of the validation set and also with the other 368 sequences of the second independent data set. The largest SI value for each query that was aligned with the 44 AFPs was plotted along the y axis, and the largest SI value for corresponding sequence aligned with the other 368 sequences of the second dataset was plotted along the x axis (Figure 1). The SI values associated with AFPs in the independent dataset that were not identified by the SVMGA are shown in red; most of these values are <20%, below the so-called midnight-zone threshold for detection of structural/functional relationship [34]. Because the dataset that contained the 369 AFPs was biased–it only contained AFPs from well-characterized cold-adapted organisms–many of the data points are located at the far end of the x axis.


Identification of antifreeze proteins and their functional residues by support vector machine and genetic algorithms based on n-peptide compositions.

Yu CS, Lu CH - PLoS ONE (2011)

Sequence identity distribution for pairs of AFPs.The x-axis values are the best pairwise-matched SI values for each AFP sequence against the other 368 sequences. The y-axis values are the best pairwise-matched SI values for each of the 369 AFP sequences of the second independent dataset against the 44 sequences of the validation set. Whether an AFP is identified (black symbol) or not identified (red symbol) in the independent data is indicated.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3105057&req=5

pone-0020445-g001: Sequence identity distribution for pairs of AFPs.The x-axis values are the best pairwise-matched SI values for each AFP sequence against the other 368 sequences. The y-axis values are the best pairwise-matched SI values for each of the 369 AFP sequences of the second independent dataset against the 44 sequences of the validation set. Whether an AFP is identified (black symbol) or not identified (red symbol) in the independent data is indicated.
Mentions: The AFPs of the second independent dataset represent a divergent group of organisms and were collected from the UniProKB database [25], [26], about 57% of these proteins were identified as AFPs by the SVMGA. The SI pair distribution, which characterizes the relative number of sequence pairs in the close percentage sequence identity interval, was used to examine the effect of sequence homology on AFP identification. The 369 AFP sequences were each used as a query sequence to profile the SI pair-distribution. Each query sequence was aligned with the 44 AFPs of the validation set and also with the other 368 sequences of the second independent data set. The largest SI value for each query that was aligned with the 44 AFPs was plotted along the y axis, and the largest SI value for corresponding sequence aligned with the other 368 sequences of the second dataset was plotted along the x axis (Figure 1). The SI values associated with AFPs in the independent dataset that were not identified by the SVMGA are shown in red; most of these values are <20%, below the so-called midnight-zone threshold for detection of structural/functional relationship [34]. Because the dataset that contained the 369 AFPs was biased–it only contained AFPs from well-characterized cold-adapted organisms–many of the data points are located at the far end of the x axis.

Bottom Line: For the first time, multiple sets of n-peptide compositions from antifreeze protein (AFP) sequences of various cold-adapted fish and insects were analyzed using support vector machine and genetic algorithms.Our results reveal that it is feasible to identify the shared sequential features among the various structural types of AFPs.This approach should be useful for genomic and proteomic studies involving cold-adapted organisms.

View Article: PubMed Central - PubMed

Affiliation: Department of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan. yucs@fcu.edu.tw

ABSTRACT
For the first time, multiple sets of n-peptide compositions from antifreeze protein (AFP) sequences of various cold-adapted fish and insects were analyzed using support vector machine and genetic algorithms. The identification of AFPs is difficult because they exist as evolutionarily divergent types, and because their sequences and structures are present in limited numbers in currently available databases. Our results reveal that it is feasible to identify the shared sequential features among the various structural types of AFPs. Moreover, we were able to identify residues involved in ice binding without requiring knowledge of the three-dimensional structures of these AFPs. This approach should be useful for genomic and proteomic studies involving cold-adapted organisms.

Show MeSH