Limits...
On finding minimal absent words.

Pinho AJ, Ferreira PJ, Garcia SP, Rodrigues JM - BMC Bioinformatics (2009)

Bottom Line: The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word.Because the set of minimal absent words that we propose is much larger than the set of the shortest absent words, it is potentially more useful for applications that require a richer variety of absent words.Nevertheless, the number of minimal absent words is still manageable since it grows at most linearly with the string size, unlike generic absent words that grow exponentially.

View Article: PubMed Central - HTML - PubMed

Affiliation: Signal Processing Lab, DETI/IEETA, University of Aveiro, 3810-193 Aveiro, Portugal. ap@ua.pt

ABSTRACT

Background: The problem of finding the shortest absent words in DNA data has been recently addressed, and algorithms for its solution have been described. It has been noted that longer absent words might also be of interest, but the existing algorithms only provide generic absent words by trivially extending the shortest ones.

Results: We show how absent words relate to the repetitions and structure of the data, and define a new and larger class of absent words, called minimal absent words, that still captures the essential properties of the shortest absent words introduced in recent works. The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word. We describe an algorithm for generating minimal absent words that, in practice, runs in approximately linear time. An implementation of this algorithm is publicly available at ftp://www.ieeta.pt/~ap/maws.

Conclusion: Because the set of minimal absent words that we propose is much larger than the set of the shortest absent words, it is potentially more useful for applications that require a richer variety of absent words. Nevertheless, the number of minimal absent words is still manageable since it grows at most linearly with the string size, unlike generic absent words that grow exponentially. Both the algorithm and the concepts upon which it depends shed additional light on the structure of absent words and complement the existing studies on the topic.

Show MeSH

Related in: MedlinePlus

Number of minimal absent words for several nucleotide sequences. Plots of the number of minimal absent words, as a function of the string length, for several nucleotide sequences ("dna" curve). The number of minimal absent words for random strings with alphabet size /Σ/ = 4 is also included for comparison ("rnd" curve).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2698904&req=5

Figure 5: Number of minimal absent words for several nucleotide sequences. Plots of the number of minimal absent words, as a function of the string length, for several nucleotide sequences ("dna" curve). The number of minimal absent words for random strings with alphabet size /Σ/ = 4 is also included for comparison ("rnd" curve).

Mentions: In Table 3, we present the number of minimal absent words and the number of generic absent words, i.e., all absent words, even those composed of shorter absent words, for the same organisms used in [3]. We have adopted the notation and for designating, respectively, the number of minimal absent words and the number of generic absent words of length n associated with string S. Our method provided the same number of omers/unwords (which coincides with the number of smallest minimal absent words) reported in [3], except for the budding yeast, Saccharomyces cerevisiae, (two, instead of the reported four) and the mouse, Mus musculus, (190 instead of 192). Nevertheless, the software provided by Herold et al. [3] reported two unwords for the S. cerevisiae and 190 for the M. musculus data that we used. Figures 5 and 6 show the total number of minimal absent words and the running time for some of the DNA sequences mentioned in Table 3. For comparison, those figures also include the results obtained with random strings over an alphabet of size four. As can be seen, the total number of minimal absent words obtained with real DNA sequences is slightly less than the number obtained with random strings, whereas the time required for producing the minimal absent words is roughly identical to the time required when using random data.


On finding minimal absent words.

Pinho AJ, Ferreira PJ, Garcia SP, Rodrigues JM - BMC Bioinformatics (2009)

Number of minimal absent words for several nucleotide sequences. Plots of the number of minimal absent words, as a function of the string length, for several nucleotide sequences ("dna" curve). The number of minimal absent words for random strings with alphabet size /Σ/ = 4 is also included for comparison ("rnd" curve).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2698904&req=5

Figure 5: Number of minimal absent words for several nucleotide sequences. Plots of the number of minimal absent words, as a function of the string length, for several nucleotide sequences ("dna" curve). The number of minimal absent words for random strings with alphabet size /Σ/ = 4 is also included for comparison ("rnd" curve).
Mentions: In Table 3, we present the number of minimal absent words and the number of generic absent words, i.e., all absent words, even those composed of shorter absent words, for the same organisms used in [3]. We have adopted the notation and for designating, respectively, the number of minimal absent words and the number of generic absent words of length n associated with string S. Our method provided the same number of omers/unwords (which coincides with the number of smallest minimal absent words) reported in [3], except for the budding yeast, Saccharomyces cerevisiae, (two, instead of the reported four) and the mouse, Mus musculus, (190 instead of 192). Nevertheless, the software provided by Herold et al. [3] reported two unwords for the S. cerevisiae and 190 for the M. musculus data that we used. Figures 5 and 6 show the total number of minimal absent words and the running time for some of the DNA sequences mentioned in Table 3. For comparison, those figures also include the results obtained with random strings over an alphabet of size four. As can be seen, the total number of minimal absent words obtained with real DNA sequences is slightly less than the number obtained with random strings, whereas the time required for producing the minimal absent words is roughly identical to the time required when using random data.

Bottom Line: The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word.Because the set of minimal absent words that we propose is much larger than the set of the shortest absent words, it is potentially more useful for applications that require a richer variety of absent words.Nevertheless, the number of minimal absent words is still manageable since it grows at most linearly with the string size, unlike generic absent words that grow exponentially.

View Article: PubMed Central - HTML - PubMed

Affiliation: Signal Processing Lab, DETI/IEETA, University of Aveiro, 3810-193 Aveiro, Portugal. ap@ua.pt

ABSTRACT

Background: The problem of finding the shortest absent words in DNA data has been recently addressed, and algorithms for its solution have been described. It has been noted that longer absent words might also be of interest, but the existing algorithms only provide generic absent words by trivially extending the shortest ones.

Results: We show how absent words relate to the repetitions and structure of the data, and define a new and larger class of absent words, called minimal absent words, that still captures the essential properties of the shortest absent words introduced in recent works. The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word. We describe an algorithm for generating minimal absent words that, in practice, runs in approximately linear time. An implementation of this algorithm is publicly available at ftp://www.ieeta.pt/~ap/maws.

Conclusion: Because the set of minimal absent words that we propose is much larger than the set of the shortest absent words, it is potentially more useful for applications that require a richer variety of absent words. Nevertheless, the number of minimal absent words is still manageable since it grows at most linearly with the string size, unlike generic absent words that grow exponentially. Both the algorithm and the concepts upon which it depends shed additional light on the structure of absent words and complement the existing studies on the topic.

Show MeSH
Related in: MedlinePlus