Limits...
An efficient approach to mining maximal contiguous frequent patterns from large DNA sequence databases.

Karim MR, Rashid MM, Jeong BS, Choi HJ - Genomics Inform (2012)

Bottom Line: The challenge is to find longer sequences without specifying sequence lengths in advance.In this paper, we propose an efficient approach to mining maximal contiguous frequent patterns from large DNA sequence datasets.The experimental results show that our proposed approach is memory-efficient and mines maximal contiguous frequent patterns within a reasonable time.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Engineering, College of Electronics and Information, Kyung Hee University, Yongin 446-701, Korea.

ABSTRACT
Mining interesting patterns from DNA sequences is one of the most challenging tasks in bioinformatics and computational biology. Maximal contiguous frequent patterns are preferable for expressing the function and structure of DNA sequences and hence can capture the common data characteristics among related sequences. Biologists are interested in finding frequent orderly arrangements of motifs that are responsible for similar expression of a group of genes. In order to reduce mining time and complexity, however, most existing sequence mining algorithms either focus on finding short DNA sequences or require explicit specification of sequence lengths in advance. The challenge is to find longer sequences without specifying sequence lengths in advance. In this paper, we propose an efficient approach to mining maximal contiguous frequent patterns from large DNA sequence datasets. The experimental results show that our proposed approach is memory-efficient and mines maximal contiguous frequent patterns within a reasonable time.

No MeSH data available.


Related in: MedlinePlus

The MCFS algorithm. MCFS, Maximal Contiguous Frequent Suffix tree algorithm; PDB, projected database.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC3475483&req=5

Figure 1: The MCFS algorithm. MCFS, Maximal Contiguous Frequent Suffix tree algorithm; PDB, projected database.

Mentions: Our proposed algorithm (Fig. 1) proceeds in three steps. The first step scans the original database and constructs four PDBs. While creating PDBs of A, T, C, and G, a register is maintained that checks for duplicate suffixes without sorting the suffixes. If a suffix appears more than once in a PDB, the suffix is inserted into a temporary buffer with support. If a suffix X has n replicas in α-PDBs, for example, we move X to the temporary buffer with support n. This kind of suffix uses very little space and thus can be stored in the memory. Then, we check the suffixes and merge the sub-sequence suffixes with super sequence suffixes according to lemma 1. In this way, the size of physical PDB does not exceed the size of the original database. Inside the PDBs, we do not register the value of sequence ID or offset, because we can directly insert suffixes into the suffix tree.


An efficient approach to mining maximal contiguous frequent patterns from large DNA sequence databases.

Karim MR, Rashid MM, Jeong BS, Choi HJ - Genomics Inform (2012)

The MCFS algorithm. MCFS, Maximal Contiguous Frequent Suffix tree algorithm; PDB, projected database.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC3475483&req=5

Figure 1: The MCFS algorithm. MCFS, Maximal Contiguous Frequent Suffix tree algorithm; PDB, projected database.
Mentions: Our proposed algorithm (Fig. 1) proceeds in three steps. The first step scans the original database and constructs four PDBs. While creating PDBs of A, T, C, and G, a register is maintained that checks for duplicate suffixes without sorting the suffixes. If a suffix appears more than once in a PDB, the suffix is inserted into a temporary buffer with support. If a suffix X has n replicas in α-PDBs, for example, we move X to the temporary buffer with support n. This kind of suffix uses very little space and thus can be stored in the memory. Then, we check the suffixes and merge the sub-sequence suffixes with super sequence suffixes according to lemma 1. In this way, the size of physical PDB does not exceed the size of the original database. Inside the PDBs, we do not register the value of sequence ID or offset, because we can directly insert suffixes into the suffix tree.

Bottom Line: The challenge is to find longer sequences without specifying sequence lengths in advance.In this paper, we propose an efficient approach to mining maximal contiguous frequent patterns from large DNA sequence datasets.The experimental results show that our proposed approach is memory-efficient and mines maximal contiguous frequent patterns within a reasonable time.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Engineering, College of Electronics and Information, Kyung Hee University, Yongin 446-701, Korea.

ABSTRACT
Mining interesting patterns from DNA sequences is one of the most challenging tasks in bioinformatics and computational biology. Maximal contiguous frequent patterns are preferable for expressing the function and structure of DNA sequences and hence can capture the common data characteristics among related sequences. Biologists are interested in finding frequent orderly arrangements of motifs that are responsible for similar expression of a group of genes. In order to reduce mining time and complexity, however, most existing sequence mining algorithms either focus on finding short DNA sequences or require explicit specification of sequence lengths in advance. The challenge is to find longer sequences without specifying sequence lengths in advance. In this paper, we propose an efficient approach to mining maximal contiguous frequent patterns from large DNA sequence datasets. The experimental results show that our proposed approach is memory-efficient and mines maximal contiguous frequent patterns within a reasonable time.

No MeSH data available.


Related in: MedlinePlus