Limits...
Comparison study on k-word statistical measures for protein: from sequence to 'sequence space'.

Dai Q, Wang T - BMC Bioinformatics (2008)

Bottom Line: This paper proposed a scheme to construct protein 'sequence space' which was associated with protein sequences related to the given protein, and the performances of statistical measures were compared when they explored the information on protein 'sequence space' or not.Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of protein 'sequence space' and statistical measures were obtained.The reasonable results of phylogenetic analysis confirm that Gdis.k based on 'sequence space' is a reliable measure for phylogenetic analysis.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Applied Mathematics, Dalian University of Technology, Dalian 116024, PR China. daiailiu2004@yahoo.com.cn

ABSTRACT

Background: Many proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information. They share the same idea of using k-word frequencies of protein sequences. Given a protein sequence, the information on its related protein sequences hasn't been used for protein sequence comparison until now. This paper proposed a scheme to construct protein 'sequence space' which was associated with protein sequences related to the given protein, and the performances of statistical measures were compared when they explored the information on protein 'sequence space' or not. This paper also presented two statistical measures for protein: gre.k (generalized relative entropy) and gsm.k (gapped similarity measure).

Results: We tested statistical measures based on protein 'sequence space' or not with three data sets. This not only offers the systematic and quantitative experimental assessment of these statistical measures, but also naturally complements the available comparison of statistical measures based on protein sequence. Moreover, we compared our statistical measures with alignment-based measures and the existing statistical measures. The experiments were grouped into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the statistical measures to discriminate and classify protein sequences. The second set of the experiments aims at assessing how well our measure does in phylogenetic analysis. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of protein 'sequence space' and statistical measures were obtained.

Conclusion: Alignment-based measures have a clear advantage when the data is high redundant. The more efficient statistical measure is the novel gsm.k introduced by this article, the cos.k followed. When the data becomes less redundant, gre.k proposed by us achieves a better performance, but all the other measures perform poorly on classification tasks. Almost all the statistical measures achieve improvement by exploring the information on 'sequence space' as word's length increases, especially for less redundant data. The reasonable results of phylogenetic analysis confirm that Gdis.k based on 'sequence space' is a reliable measure for phylogenetic analysis. In summary, our quantitative analysis verifies that exploring the information on 'sequence space' is a promising way to improve the abilities of statistical measures for protein comparison.

Show MeSH
MAUC values for data sets CK, RS and SP. The MAUC values for the data CK, RS and SP, one for each data. All the statistical measures are based on k-word frequencies of protein 'sequence space', with ten score matrices to build protein 'sequence space'.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2571980&req=5

Figure 7: MAUC values for data sets CK, RS and SP. The MAUC values for the data CK, RS and SP, one for each data. All the statistical measures are based on k-word frequencies of protein 'sequence space', with ten score matrices to build protein 'sequence space'.

Mentions: where AUC (measure, score matrix, k) denotes the area under ROC curve of the statistical measure based on the k-word frequencies of protein 'sequence space' that is built based on the score matrix. The MAUC values of all the statistical measures based on ten score matrices for three data sets are presented in Figure 7. Figure 7 largely confirms that the measures possess different performances based on different score matrices. The changes of DAUC for the data CK, RS and SP are similar. For BLOSUM score matrix, BLOSUM40 and BLOSUM100 perform better in improvement of the statistical measures' classification abilities. As for PAM score matrix, PAM120 or PAM250 improves the classification ability of all the (dis)similarity measures on the high redundant data more obviously, except for the measures eu.k and gre.k. PAM40 or PAM80 contributes to improve the classification ability of the (dis)similarity measures more obviously on the less redundant data.


Comparison study on k-word statistical measures for protein: from sequence to 'sequence space'.

Dai Q, Wang T - BMC Bioinformatics (2008)

MAUC values for data sets CK, RS and SP. The MAUC values for the data CK, RS and SP, one for each data. All the statistical measures are based on k-word frequencies of protein 'sequence space', with ten score matrices to build protein 'sequence space'.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2571980&req=5

Figure 7: MAUC values for data sets CK, RS and SP. The MAUC values for the data CK, RS and SP, one for each data. All the statistical measures are based on k-word frequencies of protein 'sequence space', with ten score matrices to build protein 'sequence space'.
Mentions: where AUC (measure, score matrix, k) denotes the area under ROC curve of the statistical measure based on the k-word frequencies of protein 'sequence space' that is built based on the score matrix. The MAUC values of all the statistical measures based on ten score matrices for three data sets are presented in Figure 7. Figure 7 largely confirms that the measures possess different performances based on different score matrices. The changes of DAUC for the data CK, RS and SP are similar. For BLOSUM score matrix, BLOSUM40 and BLOSUM100 perform better in improvement of the statistical measures' classification abilities. As for PAM score matrix, PAM120 or PAM250 improves the classification ability of all the (dis)similarity measures on the high redundant data more obviously, except for the measures eu.k and gre.k. PAM40 or PAM80 contributes to improve the classification ability of the (dis)similarity measures more obviously on the less redundant data.

Bottom Line: This paper proposed a scheme to construct protein 'sequence space' which was associated with protein sequences related to the given protein, and the performances of statistical measures were compared when they explored the information on protein 'sequence space' or not.Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of protein 'sequence space' and statistical measures were obtained.The reasonable results of phylogenetic analysis confirm that Gdis.k based on 'sequence space' is a reliable measure for phylogenetic analysis.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Applied Mathematics, Dalian University of Technology, Dalian 116024, PR China. daiailiu2004@yahoo.com.cn

ABSTRACT

Background: Many proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information. They share the same idea of using k-word frequencies of protein sequences. Given a protein sequence, the information on its related protein sequences hasn't been used for protein sequence comparison until now. This paper proposed a scheme to construct protein 'sequence space' which was associated with protein sequences related to the given protein, and the performances of statistical measures were compared when they explored the information on protein 'sequence space' or not. This paper also presented two statistical measures for protein: gre.k (generalized relative entropy) and gsm.k (gapped similarity measure).

Results: We tested statistical measures based on protein 'sequence space' or not with three data sets. This not only offers the systematic and quantitative experimental assessment of these statistical measures, but also naturally complements the available comparison of statistical measures based on protein sequence. Moreover, we compared our statistical measures with alignment-based measures and the existing statistical measures. The experiments were grouped into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the statistical measures to discriminate and classify protein sequences. The second set of the experiments aims at assessing how well our measure does in phylogenetic analysis. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of protein 'sequence space' and statistical measures were obtained.

Conclusion: Alignment-based measures have a clear advantage when the data is high redundant. The more efficient statistical measure is the novel gsm.k introduced by this article, the cos.k followed. When the data becomes less redundant, gre.k proposed by us achieves a better performance, but all the other measures perform poorly on classification tasks. Almost all the statistical measures achieve improvement by exploring the information on 'sequence space' as word's length increases, especially for less redundant data. The reasonable results of phylogenetic analysis confirm that Gdis.k based on 'sequence space' is a reliable measure for phylogenetic analysis. In summary, our quantitative analysis verifies that exploring the information on 'sequence space' is a promising way to improve the abilities of statistical measures for protein comparison.

Show MeSH