Limits...
Exploration of multivariate analysis in microbial coding sequence modeling.

Mehmood T, Bohlin J, Kristoffersen AB, Sæbø S, Warringer J, Snipen L - BMC Bioinformatics (2012)

Bottom Line: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences.We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001).Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).

View Article: PubMed Central - HTML - PubMed

Affiliation: Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Aas, Norway. tahir.mehmood@umb.no

ABSTRACT

Background: Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties.

Results: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).

Conclusions: The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.

Show MeSH

Related in: MedlinePlus

Sensitivity and specificity. The distributions of sensitivity and specificity for each species, by using IMM and CPPLS on codon sequences only. Sensitivity is defined as the ability to detect Positives and specificity as the ability to detect Negatives and both are presented in (%).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3473301&req=5

Figure 3: Sensitivity and specificity. The distributions of sensitivity and specificity for each species, by using IMM and CPPLS on codon sequences only. Sensitivity is defined as the ability to detect Positives and specificity as the ability to detect Negatives and both are presented in (%).

Mentions: It should also be noted that the archaeon Sulfolobus islandicus gives a notable drop in performance for the IMM, but less so for the CPPLS. This is possibly explained by a difference in variance in the sets of Positives and Negatives. We expect Positive sequences (coding genes) to be more homogenous than Negatives (non-coding ORFs). In any genome, the number of non-coding ORFs is many magnitudes larger than the number of coding genes and since these non-coding orfs are regarded as Negatives the variance in this set is considerably larger than the Positives set. It is therefore reasonable to expect this difference in homogeneity between the Positives and Negatives. When fitting Markov chain models to the Positives and the Negatives, we end up describing the ’average’ of both classes without taking the heterogeneity of their respective variances into account. Hence, for IMM, information about within-class heterogeneity and class size is lost. For CPPLS the regression coefficient estimates are affected by both the average and the variance in word-frequencies, as well as the number of sequences within each class. To illustrate this effect, sensitivity (the ability to identify Positives) and specificity (the ability to identify Negatives) were computed for both methods using codon frequencies (Figure3). Sensitivity is on average the same for both methods, but CPPLS exhibited a stronger ability to identify Negatives. For further understanding why a multivariate approach like CPPLS outperforms IMM, we have focused on the results for Sulfolobus islandicus, with codon representation. Figure4 presents the density of the IMM scores and CPPLS scores. For each test sequence, the IMM score is computed as the difference of Positive log-probability and Negative log-probability, and CPPLS scores are simply the fitted values. It is clear from Figure4 that the area of overlap between the red and blue density is larger for IMM (upper panel) than for CPPLS (lower panel), and especially the Negatives (blue curves) seem to stretch into the Positive side, producing false positives. Another issue is that a multivariate approach makes simultaneous use of all the available frequencies and their covariance structure. By taking this into consideration, multivariate analysis can identify important frequency effects and detect contributions from frequencies that are too small to be detected by the univariate Markov chain models. CPPLS will therefore provide superior statistical power compared to the Markov chain models as long as a model selection procedure preventing under- or over-fitting is implemented.


Exploration of multivariate analysis in microbial coding sequence modeling.

Mehmood T, Bohlin J, Kristoffersen AB, Sæbø S, Warringer J, Snipen L - BMC Bioinformatics (2012)

Sensitivity and specificity. The distributions of sensitivity and specificity for each species, by using IMM and CPPLS on codon sequences only. Sensitivity is defined as the ability to detect Positives and specificity as the ability to detect Negatives and both are presented in (%).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3473301&req=5

Figure 3: Sensitivity and specificity. The distributions of sensitivity and specificity for each species, by using IMM and CPPLS on codon sequences only. Sensitivity is defined as the ability to detect Positives and specificity as the ability to detect Negatives and both are presented in (%).
Mentions: It should also be noted that the archaeon Sulfolobus islandicus gives a notable drop in performance for the IMM, but less so for the CPPLS. This is possibly explained by a difference in variance in the sets of Positives and Negatives. We expect Positive sequences (coding genes) to be more homogenous than Negatives (non-coding ORFs). In any genome, the number of non-coding ORFs is many magnitudes larger than the number of coding genes and since these non-coding orfs are regarded as Negatives the variance in this set is considerably larger than the Positives set. It is therefore reasonable to expect this difference in homogeneity between the Positives and Negatives. When fitting Markov chain models to the Positives and the Negatives, we end up describing the ’average’ of both classes without taking the heterogeneity of their respective variances into account. Hence, for IMM, information about within-class heterogeneity and class size is lost. For CPPLS the regression coefficient estimates are affected by both the average and the variance in word-frequencies, as well as the number of sequences within each class. To illustrate this effect, sensitivity (the ability to identify Positives) and specificity (the ability to identify Negatives) were computed for both methods using codon frequencies (Figure3). Sensitivity is on average the same for both methods, but CPPLS exhibited a stronger ability to identify Negatives. For further understanding why a multivariate approach like CPPLS outperforms IMM, we have focused on the results for Sulfolobus islandicus, with codon representation. Figure4 presents the density of the IMM scores and CPPLS scores. For each test sequence, the IMM score is computed as the difference of Positive log-probability and Negative log-probability, and CPPLS scores are simply the fitted values. It is clear from Figure4 that the area of overlap between the red and blue density is larger for IMM (upper panel) than for CPPLS (lower panel), and especially the Negatives (blue curves) seem to stretch into the Positive side, producing false positives. Another issue is that a multivariate approach makes simultaneous use of all the available frequencies and their covariance structure. By taking this into consideration, multivariate analysis can identify important frequency effects and detect contributions from frequencies that are too small to be detected by the univariate Markov chain models. CPPLS will therefore provide superior statistical power compared to the Markov chain models as long as a model selection procedure preventing under- or over-fitting is implemented.

Bottom Line: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences.We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001).Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).

View Article: PubMed Central - HTML - PubMed

Affiliation: Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Aas, Norway. tahir.mehmood@umb.no

ABSTRACT

Background: Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties.

Results: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).

Conclusions: The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.

Show MeSH
Related in: MedlinePlus