Limits...
Swiftly computing center strings.

Hufsky F, Kuchenbecker L, Jahn K, Stoye J, Böcker S - BMC Bioinformatics (2011)

Bottom Line: Then, we describe a novel iterative search strategy that is efficient in practice, where some of our reduction techniques can also be applied.Finally, we present results of an evaluation study for two different data sets from a biological application.Our data reduction is very effective for both, either rejecting unsolvable instances or solving trivial positions.

View Article: PubMed Central - HTML - PubMed

Affiliation: Lehrstuhl für Bioinformatik, Friedrich-Schiller-Universität Jena, Ernst-Abbe-Platz 2, Jena, Germany. franziska.hufsky@uni-jena.de.

ABSTRACT

Background: The center string (or closest string) problem is a classic computer science problem with important applications in computational biology. Given k input strings and a distance threshold d, we search for a string within Hamming distance at most d to each input string. This problem is NP complete.

Results: In this paper, we focus on exact methods for the problem that are also swift in application. We first introduce data reduction techniques that allow us to infer that certain instances have no solution, or that a center string must satisfy certain conditions. We describe how to use this information to speed up two previously published search tree algorithms. Then, we describe a novel iterative search strategy that is efficient in practice, where some of our reduction techniques can also be applied. Finally, we present results of an evaluation study for two different data sets from a biological application.

Conclusions: We find that the running time for computing the optimal center string is dominated by the subroutine calls for d = dopt -1 and d = dopt. Our data reduction is very effective for both, either rejecting unsolvable instances or solving trivial positions. We find that this speeds up computations considerably.

Show MeSH

Related in: MedlinePlus

Percentage of trivially solved positions for d = dopt. Percentage of trivially solved positions for d = dopt plotted against the dopt /L' ratio of the instances for the first data set (full dots) and the second data set (empty dots). Dots represent individual instances, the solid line is average percentage for intervals of width 0.05.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3108310&req=5

Figure 3: Percentage of trivially solved positions for d = dopt. Percentage of trivially solved positions for d = dopt plotted against the dopt /L' ratio of the instances for the first data set (full dots) and the second data set (empty dots). Dots represent individual instances, the solid line is average percentage for intervals of width 0.05.

Mentions: The second advantage of our method is the computation of positions that can be trivially solved during preprocessing. The percentage of fixed positions is high for the important case d = dopt. In fact, for the first data set an average of 56.2% of the positions were fixed for these instances during preprocessing, and 31.7% for the second data set. Recall that MismatchCount and the algorithm of Ma and Sun work on these reduced instances. The number of solved positions depends on the dopt /L' ratio of the instance, since at least L' - 2dopt are fixed if a string pair with distance 2dopt exists, and decreases with increasing dopt /L' (Figure 3). If we use dopt /L' as a measure for the hardness of the instance, the difference between the two data sets is obvious.


Swiftly computing center strings.

Hufsky F, Kuchenbecker L, Jahn K, Stoye J, Böcker S - BMC Bioinformatics (2011)

Percentage of trivially solved positions for d = dopt. Percentage of trivially solved positions for d = dopt plotted against the dopt /L' ratio of the instances for the first data set (full dots) and the second data set (empty dots). Dots represent individual instances, the solid line is average percentage for intervals of width 0.05.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3108310&req=5

Figure 3: Percentage of trivially solved positions for d = dopt. Percentage of trivially solved positions for d = dopt plotted against the dopt /L' ratio of the instances for the first data set (full dots) and the second data set (empty dots). Dots represent individual instances, the solid line is average percentage for intervals of width 0.05.
Mentions: The second advantage of our method is the computation of positions that can be trivially solved during preprocessing. The percentage of fixed positions is high for the important case d = dopt. In fact, for the first data set an average of 56.2% of the positions were fixed for these instances during preprocessing, and 31.7% for the second data set. Recall that MismatchCount and the algorithm of Ma and Sun work on these reduced instances. The number of solved positions depends on the dopt /L' ratio of the instance, since at least L' - 2dopt are fixed if a string pair with distance 2dopt exists, and decreases with increasing dopt /L' (Figure 3). If we use dopt /L' as a measure for the hardness of the instance, the difference between the two data sets is obvious.

Bottom Line: Then, we describe a novel iterative search strategy that is efficient in practice, where some of our reduction techniques can also be applied.Finally, we present results of an evaluation study for two different data sets from a biological application.Our data reduction is very effective for both, either rejecting unsolvable instances or solving trivial positions.

View Article: PubMed Central - HTML - PubMed

Affiliation: Lehrstuhl für Bioinformatik, Friedrich-Schiller-Universität Jena, Ernst-Abbe-Platz 2, Jena, Germany. franziska.hufsky@uni-jena.de.

ABSTRACT

Background: The center string (or closest string) problem is a classic computer science problem with important applications in computational biology. Given k input strings and a distance threshold d, we search for a string within Hamming distance at most d to each input string. This problem is NP complete.

Results: In this paper, we focus on exact methods for the problem that are also swift in application. We first introduce data reduction techniques that allow us to infer that certain instances have no solution, or that a center string must satisfy certain conditions. We describe how to use this information to speed up two previously published search tree algorithms. Then, we describe a novel iterative search strategy that is efficient in practice, where some of our reduction techniques can also be applied. Finally, we present results of an evaluation study for two different data sets from a biological application.

Conclusions: We find that the running time for computing the optimal center string is dominated by the subroutine calls for d = dopt -1 and d = dopt. Our data reduction is very effective for both, either rejecting unsolvable instances or solving trivial positions. We find that this speeds up computations considerably.

Show MeSH
Related in: MedlinePlus