Limits...
Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements.

Creanza TM, Horner DS, D'Addabbo A, Maglietta R, Mignone F, Ancona N, Pesole G - BMC Bioinformatics (2009)

Bottom Line: We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites.Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value <or= 0.05).In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences - this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code.

View Article: PubMed Central - HTML - PubMed

Affiliation: Istituto di Studi sui Sistemi Intelligenti per l'Automazione, CNR, Via Amendola 122/D-I, Bari, Italy. creanza@ba.issia.cnr.it

ABSTRACT

Background: The identification of protein coding elements in sets of mammalian conserved elements is one of the major challenges in the current molecular biology research. Many features have been proposed for automatically distinguishing coding and non coding conserved sequences, making so necessary a systematic statistical assessment of their differences. A comprehensive study should be composed of an association study, i.e. a comparison of the distributions of the features in the two classes, and a prediction study in which the prediction accuracies of classifiers trained on single and groups of features are analyzed, conditionally to the compared species and to the sequence lengths.

Results: In this paper we compared distributions of a set of comparative and non comparative features and evaluated the prediction accuracy of classifiers trained for discriminating sequence elements conserved among human, mouse and rat species. The association study showed that the analyzed features are statistically different in the two classes. In order to study the influence of the sequence lengths on the feature performances, a predictive study was performed on different data sets composed of coding and non coding alignments in equal number and equally long with an ascending average length. We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites. Moreover, linear discriminant classifiers trained by using comparative features in general outperformed classifiers based on intrinsic ones. Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value

Conclusion: We observed distinct and consistent patterns for individual and combined use of comparative and intrinsic classifiers, both with respect to different lengths of sequences/alignments and with respect to error rates in the classification of coding and non-coding elements. In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences - this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code.

Show MeSH
Learning curves for the intrinsic features. The plots refer to the error rates as function of sequence length (in bp) for H. sapiens and for each intrinsic feature.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2697643&req=5

Figure 2: Learning curves for the intrinsic features. The plots refer to the error rates as function of sequence length (in bp) for H. sapiens and for each intrinsic feature.

Mentions: We analyzed the performances of classifiers trained on rat-human-mouse alignments as a function of sequence lengths. The study of this trend gives indications about the minimum sequence length required to obtain a significantly small error rate. This analysis is especially useful for methods to identify coding and non-coding CSEs which use classifiers built on sliding windows of fixed length. Accordingly, we studied how the performances of classifiers change by varying the sequence lengths for fixed data set sizes: in particular, we controlled the predictive error rates monitoring their P-values and the powers of classifiers (See "Methods: Predictive study"). To this end, we built 21 different balanced data sets, that is with coding and non coding alignments in equal number (75 coding and 75 non coding sequences), and with equally long sequences with an ascending average sequence lengths. The sequence lengths of the 21 data sets were respectively in the following ranges: [41, 50], [51,60], . . . , [241,250] bp. For each classifier and for each species, we constructed learning curves in which the empirical error rate (median, 25% and 75% quantiles) is plotted as a function of sequence length. As the trends of learning curves related to the non comparative features were very similar for all three species, we reported in Figure 2 only the results for human sequences. Concerning the comparative features, we observed that the learning curves for human versus rat comparison are very similar to those for human versus mouse. Accordingly, the Figure 3 shows only the human/mouse and mouse/rat curves. Several features show significantly different distributions for coding and non-coding sequences/alignments but are not able to accurately discriminate between the two classes of sequences (P-values > 0.05). These features include CGskew, ATskew, , %Stopstd and NPI.


Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements.

Creanza TM, Horner DS, D'Addabbo A, Maglietta R, Mignone F, Ancona N, Pesole G - BMC Bioinformatics (2009)

Learning curves for the intrinsic features. The plots refer to the error rates as function of sequence length (in bp) for H. sapiens and for each intrinsic feature.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2697643&req=5

Figure 2: Learning curves for the intrinsic features. The plots refer to the error rates as function of sequence length (in bp) for H. sapiens and for each intrinsic feature.
Mentions: We analyzed the performances of classifiers trained on rat-human-mouse alignments as a function of sequence lengths. The study of this trend gives indications about the minimum sequence length required to obtain a significantly small error rate. This analysis is especially useful for methods to identify coding and non-coding CSEs which use classifiers built on sliding windows of fixed length. Accordingly, we studied how the performances of classifiers change by varying the sequence lengths for fixed data set sizes: in particular, we controlled the predictive error rates monitoring their P-values and the powers of classifiers (See "Methods: Predictive study"). To this end, we built 21 different balanced data sets, that is with coding and non coding alignments in equal number (75 coding and 75 non coding sequences), and with equally long sequences with an ascending average sequence lengths. The sequence lengths of the 21 data sets were respectively in the following ranges: [41, 50], [51,60], . . . , [241,250] bp. For each classifier and for each species, we constructed learning curves in which the empirical error rate (median, 25% and 75% quantiles) is plotted as a function of sequence length. As the trends of learning curves related to the non comparative features were very similar for all three species, we reported in Figure 2 only the results for human sequences. Concerning the comparative features, we observed that the learning curves for human versus rat comparison are very similar to those for human versus mouse. Accordingly, the Figure 3 shows only the human/mouse and mouse/rat curves. Several features show significantly different distributions for coding and non-coding sequences/alignments but are not able to accurately discriminate between the two classes of sequences (P-values > 0.05). These features include CGskew, ATskew, , %Stopstd and NPI.

Bottom Line: We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites.Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value <or= 0.05).In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences - this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code.

View Article: PubMed Central - HTML - PubMed

Affiliation: Istituto di Studi sui Sistemi Intelligenti per l'Automazione, CNR, Via Amendola 122/D-I, Bari, Italy. creanza@ba.issia.cnr.it

ABSTRACT

Background: The identification of protein coding elements in sets of mammalian conserved elements is one of the major challenges in the current molecular biology research. Many features have been proposed for automatically distinguishing coding and non coding conserved sequences, making so necessary a systematic statistical assessment of their differences. A comprehensive study should be composed of an association study, i.e. a comparison of the distributions of the features in the two classes, and a prediction study in which the prediction accuracies of classifiers trained on single and groups of features are analyzed, conditionally to the compared species and to the sequence lengths.

Results: In this paper we compared distributions of a set of comparative and non comparative features and evaluated the prediction accuracy of classifiers trained for discriminating sequence elements conserved among human, mouse and rat species. The association study showed that the analyzed features are statistically different in the two classes. In order to study the influence of the sequence lengths on the feature performances, a predictive study was performed on different data sets composed of coding and non coding alignments in equal number and equally long with an ascending average length. We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites. Moreover, linear discriminant classifiers trained by using comparative features in general outperformed classifiers based on intrinsic ones. Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value

Conclusion: We observed distinct and consistent patterns for individual and combined use of comparative and intrinsic classifiers, both with respect to different lengths of sequences/alignments and with respect to error rates in the classification of coding and non-coding elements. In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences - this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code.

Show MeSH