Limits...
A gene pattern mining algorithm using interchangeable gene sets for prokaryotes.

Hu M, Choi K, Su W, Kim S, Yang J - BMC Bioinformatics (2008)

Bottom Line: These algorithms use the optimization problem formulation which is solved using the dynamic programming technique.In an experiment with four newly sequenced genomes (where the gene annotation is unavailable), we show that the gene pattern can capture important biological information.The discovered gene patterns can be used for the detecting of ortholog and genes that collaborate for a common biological function.

View Article: PubMed Central - HTML - PubMed

Affiliation: EECS, Case Western Reserve University, Cleveland, OH 44106 USA. meng.hu@case.edu

ABSTRACT

Background: Mining gene patterns that are common to multiple genomes is an important biological problem, which can lead us to novel biological insights. When family classification of genes is available, this problem is similar to the pattern mining problem in the data mining community. However, when family classification information is not available, mining gene patterns is a challenging problem. There are several well developed algorithms for predicting gene patterns in a pair of genomes, such as FISH and DAGchainer. These algorithms use the optimization problem formulation which is solved using the dynamic programming technique. Unfortunately, extending these algorithms to multiple genome cases is not trivial due to the rapid increase in time and space complexity.

Results: In this paper, we propose a novel algorithm for mining gene patterns in more than two prokaryote genomes using interchangeable sets. The basic idea is to extend the pattern mining technique from the data mining community to handle the situation where family classification information is not available using interchangeable sets. In an experiment with four newly sequenced genomes (where the gene annotation is unavailable), we show that the gene pattern can capture important biological information. To examine the effectiveness of gene patterns further, we propose an ortholog prediction method based on our gene pattern mining algorithm and compare our method to the bi-directional best hit (BBH) technique in terms of COG orthologous gene classification information. The experiment show that our algorithm achieves a 3% increase in recall compared to BBH without sacrificing the precision of ortholog detection.

Conclusion: The discovered gene patterns can be used for the detecting of ortholog and genes that collaborate for a common biological function.

Show MeSH
Accuracy of DISPattern w.r.t. Top-k Ortholog Candidates. When assigning a gene, we use the k most common Orthologs as candidates. This figure shows the accuracy when a different number of ortholog candidates are used.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2279103&req=5

Figure 3: Accuracy of DISPattern w.r.t. Top-k Ortholog Candidates. When assigning a gene, we use the k most common Orthologs as candidates. This figure shows the accuracy when a different number of ortholog candidates are used.

Mentions: To assign a gene to its candidate orthologous groups, we choose the top-K most common orthologous groups of this gene's hits on other genomes. When K is set to 1, it is essentially the same as BBH. When K is larger, more candidate groups will be assigned to each gene. The effect of the K value on the precision and recall of the ortholog discovery method is shown in figure 3. The recall/precision increases when K changes from 1 to 4. However, the recall and precision will decrease slightly when K becomes larger than 4. The reason for this is that assigning each gene to too many candidate orthologous groups introduces too much noise. Even when the pruning step is conducted later by applying the discovered patterns, the errors introduced by the noise will not be fully removed. In this experiment, the BHH is used as the baseline model. Since the BHH always assigns a gene to the COG group with best hits, the recall of BHH stays constant. (BHH also has the same recall and precision since it assigns a gene to only one COG.) The benefit of our method comes from the fact that the gene pattern can be used as a tool in finding the correct orthologous group if a gene can be grouped into multiple groups, i.e., the gene has high similar with genes in multiple groups. When K = 1, each gene is assigned to one orthologous group. Thus, there is no benefit of our method. However, with a larger K, i.e., a gene may be associated with multiple groups, our method can prune out the incorrect group assignment. For instance, if gene g has higher similarity with genes in group A than genes in group B, the BBH will assign A to g. However, g's true annotation may be B. In this case, the gene pattern may be able to correctly assign g to B based on the location of gene g.


A gene pattern mining algorithm using interchangeable gene sets for prokaryotes.

Hu M, Choi K, Su W, Kim S, Yang J - BMC Bioinformatics (2008)

Accuracy of DISPattern w.r.t. Top-k Ortholog Candidates. When assigning a gene, we use the k most common Orthologs as candidates. This figure shows the accuracy when a different number of ortholog candidates are used.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2279103&req=5

Figure 3: Accuracy of DISPattern w.r.t. Top-k Ortholog Candidates. When assigning a gene, we use the k most common Orthologs as candidates. This figure shows the accuracy when a different number of ortholog candidates are used.
Mentions: To assign a gene to its candidate orthologous groups, we choose the top-K most common orthologous groups of this gene's hits on other genomes. When K is set to 1, it is essentially the same as BBH. When K is larger, more candidate groups will be assigned to each gene. The effect of the K value on the precision and recall of the ortholog discovery method is shown in figure 3. The recall/precision increases when K changes from 1 to 4. However, the recall and precision will decrease slightly when K becomes larger than 4. The reason for this is that assigning each gene to too many candidate orthologous groups introduces too much noise. Even when the pruning step is conducted later by applying the discovered patterns, the errors introduced by the noise will not be fully removed. In this experiment, the BHH is used as the baseline model. Since the BHH always assigns a gene to the COG group with best hits, the recall of BHH stays constant. (BHH also has the same recall and precision since it assigns a gene to only one COG.) The benefit of our method comes from the fact that the gene pattern can be used as a tool in finding the correct orthologous group if a gene can be grouped into multiple groups, i.e., the gene has high similar with genes in multiple groups. When K = 1, each gene is assigned to one orthologous group. Thus, there is no benefit of our method. However, with a larger K, i.e., a gene may be associated with multiple groups, our method can prune out the incorrect group assignment. For instance, if gene g has higher similarity with genes in group A than genes in group B, the BBH will assign A to g. However, g's true annotation may be B. In this case, the gene pattern may be able to correctly assign g to B based on the location of gene g.

Bottom Line: These algorithms use the optimization problem formulation which is solved using the dynamic programming technique.In an experiment with four newly sequenced genomes (where the gene annotation is unavailable), we show that the gene pattern can capture important biological information.The discovered gene patterns can be used for the detecting of ortholog and genes that collaborate for a common biological function.

View Article: PubMed Central - HTML - PubMed

Affiliation: EECS, Case Western Reserve University, Cleveland, OH 44106 USA. meng.hu@case.edu

ABSTRACT

Background: Mining gene patterns that are common to multiple genomes is an important biological problem, which can lead us to novel biological insights. When family classification of genes is available, this problem is similar to the pattern mining problem in the data mining community. However, when family classification information is not available, mining gene patterns is a challenging problem. There are several well developed algorithms for predicting gene patterns in a pair of genomes, such as FISH and DAGchainer. These algorithms use the optimization problem formulation which is solved using the dynamic programming technique. Unfortunately, extending these algorithms to multiple genome cases is not trivial due to the rapid increase in time and space complexity.

Results: In this paper, we propose a novel algorithm for mining gene patterns in more than two prokaryote genomes using interchangeable sets. The basic idea is to extend the pattern mining technique from the data mining community to handle the situation where family classification information is not available using interchangeable sets. In an experiment with four newly sequenced genomes (where the gene annotation is unavailable), we show that the gene pattern can capture important biological information. To examine the effectiveness of gene patterns further, we propose an ortholog prediction method based on our gene pattern mining algorithm and compare our method to the bi-directional best hit (BBH) technique in terms of COG orthologous gene classification information. The experiment show that our algorithm achieves a 3% increase in recall compared to BBH without sacrificing the precision of ortholog detection.

Conclusion: The discovered gene patterns can be used for the detecting of ortholog and genes that collaborate for a common biological function.

Show MeSH