Limits...
Parallelization and optimization of genetic analyses in isolation by distance web service.

Turner JL, Kelley ST, Otto JS, Valafar F, Bohonak AJ - BMC Genet. (2009)

Bottom Line: We found that these sections of code could be restructured and parallelized to improve efficiency.The fact that the website has continued to work well in "real-world" tests, and receives a considerable number of new citations provides the strongest testimony to the effectiveness of our improvements.However, we soon expect the need to upgrade the number of nodes in our cluster significantly as dataset sizes continue to expand.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biology, San Diego State University, San Diego, California 92182, USA. juleigha27@gmail.com

ABSTRACT

Background: The Isolation by Distance Web Service (IBDWS) is a user-friendly web interface for analyzing patterns of isolation by distance in population genetic data. IBDWS enables researchers to perform a variety of statistical tests such as Mantel tests and reduced major axis regression (RMA), and returns vector based graphs. The more than 60 citations since 2005 confirm the popularity and utility of this website. Despite its usefulness, the data sets with over 65 populations can take hours or days to complete due to the computational intensity of the statistical tests. This is especially troublesome for web-based software analysis, since users tend to expect real-time results on the order of seconds, or at most, minutes. Moreover, as genetic data continue to increase and diversify, so does the demand for more processing power. In order to increase the speed and efficiency of IBDWS, we first determined which aspects of the code were most time consuming and whether they might be amenable to improvements by parallelization or algorithmic optimization.

Results: Runtime tests uncovered two areas of IBDWS that consumed significant amounts of time: randomizations within the Mantel test and the RMA calculations. We found that these sections of code could be restructured and parallelized to improve efficiency. The code was first optimized by combining two similar randomization routines, implementing a Fisher-Yates shuffling algorithm, and then parallelizing those routines. Tests of the parallelization and Fisher-Yates algorithmic improvements were performed on a variety of data sets ranging from 10 to 150 populations. All tested algorithms showed runtime reductions and a very close fit to the predicted speedups based on time-complexity calculations. In the case of 150 populations with 10,000 randomizations, data were analyzed 23 times faster.

Conclusion: Since the implementation of the new algorithms in late 2007, datasets have continued to increase substantially in size and many exceed the largest population sizes we used in our test sets. The fact that the website has continued to work well in "real-world" tests, and receives a considerable number of new citations provides the strongest testimony to the effectiveness of our improvements. However, we soon expect the need to upgrade the number of nodes in our cluster significantly as dataset sizes continue to expand. The parallel implementation can be found at http://ibdws.sdsu.edu/.

Show MeSH

Related in: MedlinePlus

Effect of loop randomization combination by number of randomizations. Tests of combined loop randomizations with a data set of 50 populations, analyzed with 100, 1000, and 10,000 randomizations.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2713274&req=5

Figure 4: Effect of loop randomization combination by number of randomizations. Tests of combined loop randomizations with a data set of 50 populations, analyzed with 100, 1000, and 10,000 randomizations.

Mentions: Initial experiments were performed to test the two code enhancements prior to parallelization: Mantel/RMA combined and Fisher-Yates shuffling. First, the method of combining the RMA and Mantel randomization loops into one loop was tested with the expectation that the amount of time should decrease, but overall time complexity should stay the same at O(n). (For details on how we calculated the expected speedup improvements, see Additional File 1 – Time complexity analysis.) The runs behaved as expected for a range of population sizes and number of randomizations (Figure 3, 4). As the number of populations increased, the time was consistently less for the combined loop at every point (Figure 3), and also lower for 1000 and 10,000 randomizations (Figure 4).


Parallelization and optimization of genetic analyses in isolation by distance web service.

Turner JL, Kelley ST, Otto JS, Valafar F, Bohonak AJ - BMC Genet. (2009)

Effect of loop randomization combination by number of randomizations. Tests of combined loop randomizations with a data set of 50 populations, analyzed with 100, 1000, and 10,000 randomizations.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2713274&req=5

Figure 4: Effect of loop randomization combination by number of randomizations. Tests of combined loop randomizations with a data set of 50 populations, analyzed with 100, 1000, and 10,000 randomizations.
Mentions: Initial experiments were performed to test the two code enhancements prior to parallelization: Mantel/RMA combined and Fisher-Yates shuffling. First, the method of combining the RMA and Mantel randomization loops into one loop was tested with the expectation that the amount of time should decrease, but overall time complexity should stay the same at O(n). (For details on how we calculated the expected speedup improvements, see Additional File 1 – Time complexity analysis.) The runs behaved as expected for a range of population sizes and number of randomizations (Figure 3, 4). As the number of populations increased, the time was consistently less for the combined loop at every point (Figure 3), and also lower for 1000 and 10,000 randomizations (Figure 4).

Bottom Line: We found that these sections of code could be restructured and parallelized to improve efficiency.The fact that the website has continued to work well in "real-world" tests, and receives a considerable number of new citations provides the strongest testimony to the effectiveness of our improvements.However, we soon expect the need to upgrade the number of nodes in our cluster significantly as dataset sizes continue to expand.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biology, San Diego State University, San Diego, California 92182, USA. juleigha27@gmail.com

ABSTRACT

Background: The Isolation by Distance Web Service (IBDWS) is a user-friendly web interface for analyzing patterns of isolation by distance in population genetic data. IBDWS enables researchers to perform a variety of statistical tests such as Mantel tests and reduced major axis regression (RMA), and returns vector based graphs. The more than 60 citations since 2005 confirm the popularity and utility of this website. Despite its usefulness, the data sets with over 65 populations can take hours or days to complete due to the computational intensity of the statistical tests. This is especially troublesome for web-based software analysis, since users tend to expect real-time results on the order of seconds, or at most, minutes. Moreover, as genetic data continue to increase and diversify, so does the demand for more processing power. In order to increase the speed and efficiency of IBDWS, we first determined which aspects of the code were most time consuming and whether they might be amenable to improvements by parallelization or algorithmic optimization.

Results: Runtime tests uncovered two areas of IBDWS that consumed significant amounts of time: randomizations within the Mantel test and the RMA calculations. We found that these sections of code could be restructured and parallelized to improve efficiency. The code was first optimized by combining two similar randomization routines, implementing a Fisher-Yates shuffling algorithm, and then parallelizing those routines. Tests of the parallelization and Fisher-Yates algorithmic improvements were performed on a variety of data sets ranging from 10 to 150 populations. All tested algorithms showed runtime reductions and a very close fit to the predicted speedups based on time-complexity calculations. In the case of 150 populations with 10,000 randomizations, data were analyzed 23 times faster.

Conclusion: Since the implementation of the new algorithms in late 2007, datasets have continued to increase substantially in size and many exceed the largest population sizes we used in our test sets. The fact that the website has continued to work well in "real-world" tests, and receives a considerable number of new citations provides the strongest testimony to the effectiveness of our improvements. However, we soon expect the need to upgrade the number of nodes in our cluster significantly as dataset sizes continue to expand. The parallel implementation can be found at http://ibdws.sdsu.edu/.

Show MeSH
Related in: MedlinePlus