Limits...
Efficient implementation of MrBayes on multi-GPU.

Bao J, Xia H, Zhou J, Liu X, Wang G - Mol. Biol. Evol. (2013)

Bottom Line: This article describes an efficient implementation a(MC)(3) (aMCMCMC) for MrBayes (MC)(3) on compute unified device architecture.By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets.Experimental results show that a(MC)(3) achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)(3) is dramatically faster than all the previous (MC)(3) algorithms and scales well to large GPU clusters.

View Article: PubMed Central - PubMed

Affiliation: College of Information Technical Science, Nankai University, Tianjin, China.

ABSTRACT
MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)(3)), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)(3) Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)(3) (aMCMCMC) for MrBayes (MC)(3) on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new "node-by-node" task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)(3) achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)(3) is dramatically faster than all the previous (MC)(3) algorithms and scales well to large GPU clusters.

Show MeSH

Related in: MedlinePlus

Speedup of the two fastest algorithms, n(MC)3 and a(MC)3 on 1/2/4 GPUs. The horizontal axis is the DNA length of 10 taxa, and the vertical axis is the speedup compared with the “fastest known” serial algorithm MrBayes 3.1.2.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3649675&req=5

mst043-F4: Speedup of the two fastest algorithms, n(MC)3 and a(MC)3 on 1/2/4 GPUs. The horizontal axis is the DNA length of 10 taxa, and the vertical axis is the speedup compared with the “fastest known” serial algorithm MrBayes 3.1.2.

Mentions: Speedups of two fastest algorithms, a(MC)3 and n(MC)3, are tested on multiple GPUs (fig. 4). As MrBayes runs eight different Markov chains simultaneously, in this experiment, both algorithms distribute eight chains evenly among eight independent processes. And then these eight processes are assigned evenly to multiple GPU cards to implement the DNA subsequence level fine-grained parallelism. We can see that a(MC)3 is far superior to n(MC)3 in both wall time and scalability. It achieved up to 170× speedup over the serial algorithm on a quad-GPU configuration.Fig. 4.


Efficient implementation of MrBayes on multi-GPU.

Bao J, Xia H, Zhou J, Liu X, Wang G - Mol. Biol. Evol. (2013)

Speedup of the two fastest algorithms, n(MC)3 and a(MC)3 on 1/2/4 GPUs. The horizontal axis is the DNA length of 10 taxa, and the vertical axis is the speedup compared with the “fastest known” serial algorithm MrBayes 3.1.2.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3649675&req=5

mst043-F4: Speedup of the two fastest algorithms, n(MC)3 and a(MC)3 on 1/2/4 GPUs. The horizontal axis is the DNA length of 10 taxa, and the vertical axis is the speedup compared with the “fastest known” serial algorithm MrBayes 3.1.2.
Mentions: Speedups of two fastest algorithms, a(MC)3 and n(MC)3, are tested on multiple GPUs (fig. 4). As MrBayes runs eight different Markov chains simultaneously, in this experiment, both algorithms distribute eight chains evenly among eight independent processes. And then these eight processes are assigned evenly to multiple GPU cards to implement the DNA subsequence level fine-grained parallelism. We can see that a(MC)3 is far superior to n(MC)3 in both wall time and scalability. It achieved up to 170× speedup over the serial algorithm on a quad-GPU configuration.Fig. 4.

Bottom Line: This article describes an efficient implementation a(MC)(3) (aMCMCMC) for MrBayes (MC)(3) on compute unified device architecture.By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets.Experimental results show that a(MC)(3) achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)(3) is dramatically faster than all the previous (MC)(3) algorithms and scales well to large GPU clusters.

View Article: PubMed Central - PubMed

Affiliation: College of Information Technical Science, Nankai University, Tianjin, China.

ABSTRACT
MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)(3)), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)(3) Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)(3) (aMCMCMC) for MrBayes (MC)(3) on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new "node-by-node" task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)(3) achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)(3) is dramatically faster than all the previous (MC)(3) algorithms and scales well to large GPU clusters.

Show MeSH
Related in: MedlinePlus