Limits...
Efficient mining of interesting patterns in large biological sequences.

Rashid MM, Karim MR, Jeong BS, Choi HJ - Genomics Inform (2012)

Bottom Line: So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not.In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin.Experimental results show that our approach can find interesting patterns within an acceptable computation time.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Engineering, College of Electronics and Information, Kyung Hee University, Yongin 446-701, Korea.

ABSTRACT
Pattern discovery in biological sequences (e.g., DNA sequences) is one of the most challenging tasks in computational biology and bioinformatics. So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not. In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin. In this paper, we propose a new interesting measure that can provide meaningful biological information. We also propose an efficient index-based method for mining such interesting patterns. Experimental results show that our approach can find interesting patterns within an acceptable computation time.

No MeSH data available.


(A, B) Impact of confidence threshold on mining time.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC3475482&req=5

Figure 9: (A, B) Impact of confidence threshold on mining time.

Mentions: The fourth experiment studied the impacts of varying minimum confidence from 0.2 to 0.5 for random datasets and 0.25 to 0.55 for real datasets. This time, we take min_in_gain 0.3 and 0.4 for random and real datasets, respectively. Fig. 9 shows our proposed performance with different values for minimum confidence and illustrates a non-linear effect. The greatest change along the confidence axis is from 0.2 to 0.3 for random datasets and 0.25 to 0.35 real datasets. In most cases, the characters are evenly distributed. This means that the A, C, G, and T occur at almost the same ratio in the dataset as the frequency of each character, which is approximately 25%. So, if the min_conf is set to 0.3, most random patterns will be filtered out, and real patterns occurring more frequently than 25% of the time will survive and be extended.


Efficient mining of interesting patterns in large biological sequences.

Rashid MM, Karim MR, Jeong BS, Choi HJ - Genomics Inform (2012)

(A, B) Impact of confidence threshold on mining time.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC3475482&req=5

Figure 9: (A, B) Impact of confidence threshold on mining time.
Mentions: The fourth experiment studied the impacts of varying minimum confidence from 0.2 to 0.5 for random datasets and 0.25 to 0.55 for real datasets. This time, we take min_in_gain 0.3 and 0.4 for random and real datasets, respectively. Fig. 9 shows our proposed performance with different values for minimum confidence and illustrates a non-linear effect. The greatest change along the confidence axis is from 0.2 to 0.3 for random datasets and 0.25 to 0.35 real datasets. In most cases, the characters are evenly distributed. This means that the A, C, G, and T occur at almost the same ratio in the dataset as the frequency of each character, which is approximately 25%. So, if the min_conf is set to 0.3, most random patterns will be filtered out, and real patterns occurring more frequently than 25% of the time will survive and be extended.

Bottom Line: So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not.In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin.Experimental results show that our approach can find interesting patterns within an acceptable computation time.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Engineering, College of Electronics and Information, Kyung Hee University, Yongin 446-701, Korea.

ABSTRACT
Pattern discovery in biological sequences (e.g., DNA sequences) is one of the most challenging tasks in computational biology and bioinformatics. So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not. In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin. In this paper, we propose a new interesting measure that can provide meaningful biological information. We also propose an efficient index-based method for mining such interesting patterns. Experimental results show that our approach can find interesting patterns within an acceptable computation time.

No MeSH data available.