Limits...
Exploration of multivariate analysis in microbial coding sequence modeling.

Mehmood T, Bohlin J, Kristoffersen AB, Sæbø S, Warringer J, Snipen L - BMC Bioinformatics (2012)

Bottom Line: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences.We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001).Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).

View Article: PubMed Central - HTML - PubMed

Affiliation: Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Aas, Norway. tahir.mehmood@umb.no

ABSTRACT

Background: Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties.

Results: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).

Conclusions: The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.

Show MeSH

Related in: MedlinePlus

Visualization of undirected graph. The clusters of highly conserved ORFs are presented, based on a very small subset taken from Acinetobacter baumannii. Nodes represent genes, and identical color means genes from the same genome. The numbers are just identifiers within each genome. First we discard the clusters having less than 3 genes, i.e. red:6-yellow:7. Next, the medoide gene from each remaining cluster forms the set of Positives.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3473301&req=5

Figure 1: Visualization of undirected graph. The clusters of highly conserved ORFs are presented, based on a very small subset taken from Acinetobacter baumannii. Nodes represent genes, and identical color means genes from the same genome. The numbers are just identifiers within each genome. First we discard the clusters having less than 3 genes, i.e. red:6-yellow:7. Next, the medoide gene from each remaining cluster forms the set of Positives.

Mentions: where d(i,j) always gives a value between 0 and 1. Next, all ORFs were represented as nodes in an undirected graph, with edges added between two nodes if the corresponding distance d(i,j) between them was below or equal to some threshold t that designates sequence similarity. Hence, we considered two ORFs to be connected if they were sufficiently similar according to a specified threshold value t. If multiple ORFs fulfill this similarity criterion a graph will form consisting of many nodes (ORFs). Such a graph will form clusters of connected nodes. Clusters with nodes designating ORFs from the genomes of multiple strains are more likely to be real coding genes since they are conserved across several genomes. A highly conserved ORF (HCO) is therefore represented as a graph with nodes from the genomes of all respective strains within a species. For each HCO cluster, the node (ORF) with the smallest sum of distances, as measured using the weighted edges to all other nodes (ORFs) in the same cluster, was extracted. Such nodes are referred to as medoids. The medoide thus represents the whole HCO cluster. The same procedure is subsequently applied repeatedly generating a list of HCOs for each species. The list of HCOs for each species contains our candidate genes and we designate that set as Positives. For illustration purposes Figure1 shows a visualized graph for a very small data subset taken from Acinetobacter baumannii.


Exploration of multivariate analysis in microbial coding sequence modeling.

Mehmood T, Bohlin J, Kristoffersen AB, Sæbø S, Warringer J, Snipen L - BMC Bioinformatics (2012)

Visualization of undirected graph. The clusters of highly conserved ORFs are presented, based on a very small subset taken from Acinetobacter baumannii. Nodes represent genes, and identical color means genes from the same genome. The numbers are just identifiers within each genome. First we discard the clusters having less than 3 genes, i.e. red:6-yellow:7. Next, the medoide gene from each remaining cluster forms the set of Positives.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3473301&req=5

Figure 1: Visualization of undirected graph. The clusters of highly conserved ORFs are presented, based on a very small subset taken from Acinetobacter baumannii. Nodes represent genes, and identical color means genes from the same genome. The numbers are just identifiers within each genome. First we discard the clusters having less than 3 genes, i.e. red:6-yellow:7. Next, the medoide gene from each remaining cluster forms the set of Positives.
Mentions: where d(i,j) always gives a value between 0 and 1. Next, all ORFs were represented as nodes in an undirected graph, with edges added between two nodes if the corresponding distance d(i,j) between them was below or equal to some threshold t that designates sequence similarity. Hence, we considered two ORFs to be connected if they were sufficiently similar according to a specified threshold value t. If multiple ORFs fulfill this similarity criterion a graph will form consisting of many nodes (ORFs). Such a graph will form clusters of connected nodes. Clusters with nodes designating ORFs from the genomes of multiple strains are more likely to be real coding genes since they are conserved across several genomes. A highly conserved ORF (HCO) is therefore represented as a graph with nodes from the genomes of all respective strains within a species. For each HCO cluster, the node (ORF) with the smallest sum of distances, as measured using the weighted edges to all other nodes (ORFs) in the same cluster, was extracted. Such nodes are referred to as medoids. The medoide thus represents the whole HCO cluster. The same procedure is subsequently applied repeatedly generating a list of HCOs for each species. The list of HCOs for each species contains our candidate genes and we designate that set as Positives. For illustration purposes Figure1 shows a visualized graph for a very small data subset taken from Acinetobacter baumannii.

Bottom Line: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences.We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001).Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).

View Article: PubMed Central - HTML - PubMed

Affiliation: Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Aas, Norway. tahir.mehmood@umb.no

ABSTRACT

Background: Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties.

Results: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).

Conclusions: The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.

Show MeSH
Related in: MedlinePlus