Limits...
Exploration of multivariate analysis in microbial coding sequence modeling.

Mehmood T, Bohlin J, Kristoffersen AB, Sæbø S, Warringer J, Snipen L - BMC Bioinformatics (2012)

Bottom Line: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences.We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001).Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).

View Article: PubMed Central - HTML - PubMed

Affiliation: Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Aas, Norway. tahir.mehmood@umb.no

ABSTRACT

Background: Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties.

Results: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).

Conclusions: The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.

Show MeSH

Related in: MedlinePlus

Performance on test data. The box and whisker plots show the distributions of performance (% correct classified) on test data for each species, by using IMM (upper panels) or CPPLS (lower panels) on ORFs represented as codon, protein or DNA sequences. The dotted red line indicates the maximum possible performance (100%). For most of the species, CPPLS on Codon sequence performance is 100 (%).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3473301&req=5

Figure 2: Performance on test data. The box and whisker plots show the distributions of performance (% correct classified) on test data for each species, by using IMM (upper panels) or CPPLS (lower panels) on ORFs represented as codon, protein or DNA sequences. The dotted red line indicates the maximum possible performance (100%). For most of the species, CPPLS on Codon sequence performance is 100 (%).

Mentions: In Figure2 we show the distributions of performance for each species by applying both the IMM and CPPLS methods on codon, protein and DNA sequences. The difference between the IMM method (upper panels) and the CPPLS method (lower panels) is the most striking result. It can be seen that the codon representation (leftmost panels) appears to be better than protein and DNA, especially for the IMM-approach. We observe non-constant variance of performance over different levels, for instance, an F-test indicates that the variation observed using CPPLS with codon representation was significantly smaller than the corresponding variance for IMM (p < 0.001) based on the original performance measure. To make a more formal test, we used a mixed interaction effect ANOVA-type model (see Method) with results presented in Table3 based on transformed performance. The analysis supports that significant variation among levels of methods (p < 0.001), sequences (p < 0.001) and method sequence interaction (p < 0.001). A Tukey test[35] with adjusted p-values for multiple comparisons, was carried out to compare the difference of means of (transformed) performance between methods and sequence representations. We found that CPPLS performed, in general, better than IMM (p < 0.001), while codons were better sequence representations than both protein and DNA (p < 0.001). No difference was found between the latter two sequence representations. Further, testing for method and sequence interaction, we found that CPPLS with codon representation performed significantly better than IMM with protein (p < 0.001) and with DNA (p < 0.001) representations. Mean performance of IMM with codon representation was similar to CPPLS with codon representation, but variation of results were significantly lower for CPPLS (p < 0.001) indicating superior performance. The estimated standard deviation of transformed performance due to random effect of species was, which is bigger than the general error term (). This indicates that performance varies a lot between species (Table3). In general, the average performance for both the IMM and CPPLS algorithms is very good. Even the worst combination, using IMM on DNA data, has more than 95% correct classifications (both Positives and Negatives) in the majority of the performed tests. Thus, both the IMM and CPPLS methods support the notion that the Positive and Negative sequences have a base composition more intrinsically similar to each other and, therefore, that our division of sequences into these two categories is meaningful. The high performance is largely an effect of our strict choice of threshold t when selecting Positives. We only included as Positives the highly conserved genes, and it is quite likely that these genes have more in common than less conserved genes. We also tried more lenient thresholds, giving larger and more heterogeneous sets of Positives (and Negatives), subsequently resulting in a drop in overall performance. However, the differences between methods and sequence representations found for subset of t = 0.3 hold throughout.


Exploration of multivariate analysis in microbial coding sequence modeling.

Mehmood T, Bohlin J, Kristoffersen AB, Sæbø S, Warringer J, Snipen L - BMC Bioinformatics (2012)

Performance on test data. The box and whisker plots show the distributions of performance (% correct classified) on test data for each species, by using IMM (upper panels) or CPPLS (lower panels) on ORFs represented as codon, protein or DNA sequences. The dotted red line indicates the maximum possible performance (100%). For most of the species, CPPLS on Codon sequence performance is 100 (%).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3473301&req=5

Figure 2: Performance on test data. The box and whisker plots show the distributions of performance (% correct classified) on test data for each species, by using IMM (upper panels) or CPPLS (lower panels) on ORFs represented as codon, protein or DNA sequences. The dotted red line indicates the maximum possible performance (100%). For most of the species, CPPLS on Codon sequence performance is 100 (%).
Mentions: In Figure2 we show the distributions of performance for each species by applying both the IMM and CPPLS methods on codon, protein and DNA sequences. The difference between the IMM method (upper panels) and the CPPLS method (lower panels) is the most striking result. It can be seen that the codon representation (leftmost panels) appears to be better than protein and DNA, especially for the IMM-approach. We observe non-constant variance of performance over different levels, for instance, an F-test indicates that the variation observed using CPPLS with codon representation was significantly smaller than the corresponding variance for IMM (p < 0.001) based on the original performance measure. To make a more formal test, we used a mixed interaction effect ANOVA-type model (see Method) with results presented in Table3 based on transformed performance. The analysis supports that significant variation among levels of methods (p < 0.001), sequences (p < 0.001) and method sequence interaction (p < 0.001). A Tukey test[35] with adjusted p-values for multiple comparisons, was carried out to compare the difference of means of (transformed) performance between methods and sequence representations. We found that CPPLS performed, in general, better than IMM (p < 0.001), while codons were better sequence representations than both protein and DNA (p < 0.001). No difference was found between the latter two sequence representations. Further, testing for method and sequence interaction, we found that CPPLS with codon representation performed significantly better than IMM with protein (p < 0.001) and with DNA (p < 0.001) representations. Mean performance of IMM with codon representation was similar to CPPLS with codon representation, but variation of results were significantly lower for CPPLS (p < 0.001) indicating superior performance. The estimated standard deviation of transformed performance due to random effect of species was, which is bigger than the general error term (). This indicates that performance varies a lot between species (Table3). In general, the average performance for both the IMM and CPPLS algorithms is very good. Even the worst combination, using IMM on DNA data, has more than 95% correct classifications (both Positives and Negatives) in the majority of the performed tests. Thus, both the IMM and CPPLS methods support the notion that the Positive and Negative sequences have a base composition more intrinsically similar to each other and, therefore, that our division of sequences into these two categories is meaningful. The high performance is largely an effect of our strict choice of threshold t when selecting Positives. We only included as Positives the highly conserved genes, and it is quite likely that these genes have more in common than less conserved genes. We also tried more lenient thresholds, giving larger and more heterogeneous sets of Positives (and Negatives), subsequently resulting in a drop in overall performance. However, the differences between methods and sequence representations found for subset of t = 0.3 hold throughout.

Bottom Line: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences.We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001).Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).

View Article: PubMed Central - HTML - PubMed

Affiliation: Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Aas, Norway. tahir.mehmood@umb.no

ABSTRACT

Background: Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties.

Results: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).

Conclusions: The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.

Show MeSH
Related in: MedlinePlus