Limits...
An efficient approach to mining maximal contiguous frequent patterns from large DNA sequence databases.

Karim MR, Rashid MM, Jeong BS, Choi HJ - Genomics Inform (2012)

Bottom Line: The challenge is to find longer sequences without specifying sequence lengths in advance.In this paper, we propose an efficient approach to mining maximal contiguous frequent patterns from large DNA sequence datasets.The experimental results show that our proposed approach is memory-efficient and mines maximal contiguous frequent patterns within a reasonable time.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Engineering, College of Electronics and Information, Kyung Hee University, Yongin 446-701, Korea.

ABSTRACT
Mining interesting patterns from DNA sequences is one of the most challenging tasks in bioinformatics and computational biology. Maximal contiguous frequent patterns are preferable for expressing the function and structure of DNA sequences and hence can capture the common data characteristics among related sequences. Biologists are interested in finding frequent orderly arrangements of motifs that are responsible for similar expression of a group of genes. In order to reduce mining time and complexity, however, most existing sequence mining algorithms either focus on finding short DNA sequences or require explicit specification of sequence lengths in advance. The challenge is to find longer sequences without specifying sequence lengths in advance. In this paper, we propose an efficient approach to mining maximal contiguous frequent patterns from large DNA sequence datasets. The experimental results show that our proposed approach is memory-efficient and mines maximal contiguous frequent patterns within a reasonable time.

No MeSH data available.


Related in: MedlinePlus

Memory usage w.r.t. change of minimum support (bacteria genome sequence). MCFS, Maximal Contiguous Frequent Suffix tree algorithm.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC3475483&req=5

Figure 7: Memory usage w.r.t. change of minimum support (bacteria genome sequence). MCFS, Maximal Contiguous Frequent Suffix tree algorithm.

Mentions: We also compared the memory usage of the three approaches with various values of minimum support. The search space was relatively smaller, because we made use of sub-sequence and super-sequence relationships, and whenever reaching to the minimum support threshold, the sub-sequence for contiguous frequent patterns was not searched. Figs. 6 and 7 show the memory usage, indicating that our approach shows relatively low memory usage compared to the other two. Although both MacosVSpan and our MCFS algorithm process one PDB after another and then produce the maximal contiguous frequent patterns by traversing the suffix tree, the size of the PDB cannot be larger than the original database (according to our proposed lemma); hence, the PDBs can be fit in the memory. This is why MCFS consumes much less memory compared to MacosVSpan.


An efficient approach to mining maximal contiguous frequent patterns from large DNA sequence databases.

Karim MR, Rashid MM, Jeong BS, Choi HJ - Genomics Inform (2012)

Memory usage w.r.t. change of minimum support (bacteria genome sequence). MCFS, Maximal Contiguous Frequent Suffix tree algorithm.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC3475483&req=5

Figure 7: Memory usage w.r.t. change of minimum support (bacteria genome sequence). MCFS, Maximal Contiguous Frequent Suffix tree algorithm.
Mentions: We also compared the memory usage of the three approaches with various values of minimum support. The search space was relatively smaller, because we made use of sub-sequence and super-sequence relationships, and whenever reaching to the minimum support threshold, the sub-sequence for contiguous frequent patterns was not searched. Figs. 6 and 7 show the memory usage, indicating that our approach shows relatively low memory usage compared to the other two. Although both MacosVSpan and our MCFS algorithm process one PDB after another and then produce the maximal contiguous frequent patterns by traversing the suffix tree, the size of the PDB cannot be larger than the original database (according to our proposed lemma); hence, the PDBs can be fit in the memory. This is why MCFS consumes much less memory compared to MacosVSpan.

Bottom Line: The challenge is to find longer sequences without specifying sequence lengths in advance.In this paper, we propose an efficient approach to mining maximal contiguous frequent patterns from large DNA sequence datasets.The experimental results show that our proposed approach is memory-efficient and mines maximal contiguous frequent patterns within a reasonable time.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Engineering, College of Electronics and Information, Kyung Hee University, Yongin 446-701, Korea.

ABSTRACT
Mining interesting patterns from DNA sequences is one of the most challenging tasks in bioinformatics and computational biology. Maximal contiguous frequent patterns are preferable for expressing the function and structure of DNA sequences and hence can capture the common data characteristics among related sequences. Biologists are interested in finding frequent orderly arrangements of motifs that are responsible for similar expression of a group of genes. In order to reduce mining time and complexity, however, most existing sequence mining algorithms either focus on finding short DNA sequences or require explicit specification of sequence lengths in advance. The challenge is to find longer sequences without specifying sequence lengths in advance. In this paper, we propose an efficient approach to mining maximal contiguous frequent patterns from large DNA sequence datasets. The experimental results show that our proposed approach is memory-efficient and mines maximal contiguous frequent patterns within a reasonable time.

No MeSH data available.


Related in: MedlinePlus