Limits...
Efficient mining of interesting patterns in large biological sequences.

Rashid MM, Karim MR, Jeong BS, Choi HJ - Genomics Inform (2012)

Bottom Line: So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not.In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin.Experimental results show that our approach can find interesting patterns within an acceptable computation time.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Engineering, College of Electronics and Information, Kyung Hee University, Yongin 446-701, Korea.

ABSTRACT
Pattern discovery in biological sequences (e.g., DNA sequences) is one of the most challenging tasks in computational biology and bioinformatics. So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not. In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin. In this paper, we propose a new interesting measure that can provide meaningful biological information. We also propose an efficient index-based method for mining such interesting patterns. Experimental results show that our approach can find interesting patterns within an acceptable computation time.

No MeSH data available.


Surprising contiguous pattern mining algorithm.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC3475482&req=5

Figure 2: Surprising contiguous pattern mining algorithm.

Mentions: Fig. 2 shows our proposed algorithm. It has two steps. The first step recursively scans the subsequence with the same length as the fixed-length window from the given sequences to construct a fixed-length spanning tree and put the database sequence ID and starting position of the pattern in the leaf node of tree as a one-side open variable length array. At this time, it calculates the character probabilities for A, C, G, and T, respectively. It is straightforward for the calculation of character probabilities for A, C, G, and T. By adding all occurrences of A, C, G, and T and dividing by the total number of characters in the database, it can be obtained easily. After calculating character probabilities, we can calculate character information, pattern information, and pattern information gain by using definition 5, definition 6, and definition 7.


Efficient mining of interesting patterns in large biological sequences.

Rashid MM, Karim MR, Jeong BS, Choi HJ - Genomics Inform (2012)

Surprising contiguous pattern mining algorithm.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC3475482&req=5

Figure 2: Surprising contiguous pattern mining algorithm.
Mentions: Fig. 2 shows our proposed algorithm. It has two steps. The first step recursively scans the subsequence with the same length as the fixed-length window from the given sequences to construct a fixed-length spanning tree and put the database sequence ID and starting position of the pattern in the leaf node of tree as a one-side open variable length array. At this time, it calculates the character probabilities for A, C, G, and T, respectively. It is straightforward for the calculation of character probabilities for A, C, G, and T. By adding all occurrences of A, C, G, and T and dividing by the total number of characters in the database, it can be obtained easily. After calculating character probabilities, we can calculate character information, pattern information, and pattern information gain by using definition 5, definition 6, and definition 7.

Bottom Line: So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not.In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin.Experimental results show that our approach can find interesting patterns within an acceptable computation time.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Engineering, College of Electronics and Information, Kyung Hee University, Yongin 446-701, Korea.

ABSTRACT
Pattern discovery in biological sequences (e.g., DNA sequences) is one of the most challenging tasks in computational biology and bioinformatics. So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not. In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin. In this paper, we propose a new interesting measure that can provide meaningful biological information. We also propose an efficient index-based method for mining such interesting patterns. Experimental results show that our approach can find interesting patterns within an acceptable computation time.

No MeSH data available.