Limits...
Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements.

Creanza TM, Horner DS, D'Addabbo A, Maglietta R, Mignone F, Ancona N, Pesole G - BMC Bioinformatics (2009)

Bottom Line: We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites.Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value <or= 0.05).In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences - this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code.

View Article: PubMed Central - HTML - PubMed

Affiliation: Istituto di Studi sui Sistemi Intelligenti per l'Automazione, CNR, Via Amendola 122/D-I, Bari, Italy. creanza@ba.issia.cnr.it

ABSTRACT

Background: The identification of protein coding elements in sets of mammalian conserved elements is one of the major challenges in the current molecular biology research. Many features have been proposed for automatically distinguishing coding and non coding conserved sequences, making so necessary a systematic statistical assessment of their differences. A comprehensive study should be composed of an association study, i.e. a comparison of the distributions of the features in the two classes, and a prediction study in which the prediction accuracies of classifiers trained on single and groups of features are analyzed, conditionally to the compared species and to the sequence lengths.

Results: In this paper we compared distributions of a set of comparative and non comparative features and evaluated the prediction accuracy of classifiers trained for discriminating sequence elements conserved among human, mouse and rat species. The association study showed that the analyzed features are statistically different in the two classes. In order to study the influence of the sequence lengths on the feature performances, a predictive study was performed on different data sets composed of coding and non coding alignments in equal number and equally long with an ascending average length. We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites. Moreover, linear discriminant classifiers trained by using comparative features in general outperformed classifiers based on intrinsic ones. Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value

Conclusion: We observed distinct and consistent patterns for individual and combined use of comparative and intrinsic classifiers, both with respect to different lengths of sequences/alignments and with respect to error rates in the classification of coding and non-coding elements. In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences - this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code.

Show MeSH
The summary plot. The error bars in the figure represent the median, 25% and 75% quantiles error rates for H. sapiens sequences and for their pairwise alignments: the red, blue, yellow and green bars refer to the 4 classes of ascending sequence lengths in the legend.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2697643&req=5

Figure 4: The summary plot. The error bars in the figure represent the median, 25% and 75% quantiles error rates for H. sapiens sequences and for their pairwise alignments: the red, blue, yellow and green bars refer to the 4 classes of ascending sequence lengths in the legend.

Mentions: In order to compare the performance of all classifiers according to the sequence lengths, we summarized the results for human sequences and their alignments in a unique plot shown in Figure 4. To build this summary plot, we grouped the empirical estimated error rates in four classes by averaging error rates of the classifiers related to the increasing sequence lengths: [41,90[, [91,140[, [141,190[, [191,250] bp. Our analysis suggests that the percentage of stop codons %Stopm, the rate ratio SRRM and its spread SRRstd are the most discriminatory features for the species considered. Nevertheless, %Stopm is strongly influenced by the sequence lengths, while SRRstd and SRRm exhibit less than 30% error rates even for sequences and alignments with lengths in the [41, 90[base range. Finally, we point out the behavior of the classifier trained by using Blosum scores BL80M which provides small error rates independently of the sequence lengths. Lower performance is expected for all methods when confronted with short sequences due to stochastic factors, the poor performance of the %Stopm metric with short sequences is unsurprising given that low numbers of stop codons are expected even for non-coding sequences when the length is short. In order to obtain higher accuracies, for each sequence length we trained a classifier with the comparative features and a classifier with the intrinsic ones.


Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements.

Creanza TM, Horner DS, D'Addabbo A, Maglietta R, Mignone F, Ancona N, Pesole G - BMC Bioinformatics (2009)

The summary plot. The error bars in the figure represent the median, 25% and 75% quantiles error rates for H. sapiens sequences and for their pairwise alignments: the red, blue, yellow and green bars refer to the 4 classes of ascending sequence lengths in the legend.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2697643&req=5

Figure 4: The summary plot. The error bars in the figure represent the median, 25% and 75% quantiles error rates for H. sapiens sequences and for their pairwise alignments: the red, blue, yellow and green bars refer to the 4 classes of ascending sequence lengths in the legend.
Mentions: In order to compare the performance of all classifiers according to the sequence lengths, we summarized the results for human sequences and their alignments in a unique plot shown in Figure 4. To build this summary plot, we grouped the empirical estimated error rates in four classes by averaging error rates of the classifiers related to the increasing sequence lengths: [41,90[, [91,140[, [141,190[, [191,250] bp. Our analysis suggests that the percentage of stop codons %Stopm, the rate ratio SRRM and its spread SRRstd are the most discriminatory features for the species considered. Nevertheless, %Stopm is strongly influenced by the sequence lengths, while SRRstd and SRRm exhibit less than 30% error rates even for sequences and alignments with lengths in the [41, 90[base range. Finally, we point out the behavior of the classifier trained by using Blosum scores BL80M which provides small error rates independently of the sequence lengths. Lower performance is expected for all methods when confronted with short sequences due to stochastic factors, the poor performance of the %Stopm metric with short sequences is unsurprising given that low numbers of stop codons are expected even for non-coding sequences when the length is short. In order to obtain higher accuracies, for each sequence length we trained a classifier with the comparative features and a classifier with the intrinsic ones.

Bottom Line: We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites.Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value <or= 0.05).In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences - this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code.

View Article: PubMed Central - HTML - PubMed

Affiliation: Istituto di Studi sui Sistemi Intelligenti per l'Automazione, CNR, Via Amendola 122/D-I, Bari, Italy. creanza@ba.issia.cnr.it

ABSTRACT

Background: The identification of protein coding elements in sets of mammalian conserved elements is one of the major challenges in the current molecular biology research. Many features have been proposed for automatically distinguishing coding and non coding conserved sequences, making so necessary a systematic statistical assessment of their differences. A comprehensive study should be composed of an association study, i.e. a comparison of the distributions of the features in the two classes, and a prediction study in which the prediction accuracies of classifiers trained on single and groups of features are analyzed, conditionally to the compared species and to the sequence lengths.

Results: In this paper we compared distributions of a set of comparative and non comparative features and evaluated the prediction accuracy of classifiers trained for discriminating sequence elements conserved among human, mouse and rat species. The association study showed that the analyzed features are statistically different in the two classes. In order to study the influence of the sequence lengths on the feature performances, a predictive study was performed on different data sets composed of coding and non coding alignments in equal number and equally long with an ascending average length. We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites. Moreover, linear discriminant classifiers trained by using comparative features in general outperformed classifiers based on intrinsic ones. Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value

Conclusion: We observed distinct and consistent patterns for individual and combined use of comparative and intrinsic classifiers, both with respect to different lengths of sequences/alignments and with respect to error rates in the classification of coding and non-coding elements. In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences - this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code.

Show MeSH