Limits...
Accelerating fibre orientation estimation from diffusion weighted magnetic resonance imaging using GPUs.

Hernández M, Guerrero GD, Cecilia JM, García JM, Inuggi A, Jbabdi S, Behrens TE, Sotiropoulos SN - PLoS ONE (2013)

Bottom Line: With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands.We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version.We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, University of Murcia, Murcia, Spain. moises.hernandez@um.es

ABSTRACT
With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands. Modern graphics processing units (GPUs) are massively parallel processors that can execute simultaneously thousands of light-weight processes. In this study, we propose and implement a parallel GPU-based design of a popular method that is used for the analysis of brain magnetic resonance imaging (MRI). More specifically, we are concerned with a model-based approach for extracting tissue structural information from diffusion-weighted (DW) MRI data. DW-MRI offers, through tractography approaches, the only way to study brain structural connectivity, non-invasively and in-vivo. We parallelise the Bayesian inference framework for the ball & stick model, as it is implemented in the tractography toolbox of the popular FSL software package (University of Oxford). For our implementation, we utilise the Compute Unified Device Architecture (CUDA) programming model. We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version. We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

Show MeSH
Execution times for the MCMC GPU kernel using different number of threads per block .Results are shown for different number K of gradient directions (50, 100 and 200), for a slice of 4804 voxels ( fibres,  MCMC iterations (3000 burn-in)).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3643787&req=5

pone-0061892-g006: Execution times for the MCMC GPU kernel using different number of threads per block .Results are shown for different number K of gradient directions (50, 100 and 200), for a slice of 4804 voxels ( fibres, MCMC iterations (3000 burn-in)).

Mentions: As in the case of the Levenberg-Marquardt kernel, when we choose the number of threads per block in the MCMC kernel, our target is to have simultaneously as many warps per SM as possible (to “hide” latencies from Global memory accesses). In the MCMC kernel, there are additional accesses, as the random numbers used during the MCMC are pre-generated and stored in the Global memory. If we use there will be idle threads inside the block and we will have unused resources (SPs and shared memory). If is not a multiple of K then the gradient directions will be distributed unevenly across threads. In addition, the limitations discussed in previous sections also exist, making the choice of the optimal less straightforward. We therefore evaluated performance for different numbers of and different numbers of K. Figure 6 shows some illustrative examples. As it can be observed, a of 64 gave on average the smallest execution times. We therefore chose for this kernel.


Accelerating fibre orientation estimation from diffusion weighted magnetic resonance imaging using GPUs.

Hernández M, Guerrero GD, Cecilia JM, García JM, Inuggi A, Jbabdi S, Behrens TE, Sotiropoulos SN - PLoS ONE (2013)

Execution times for the MCMC GPU kernel using different number of threads per block .Results are shown for different number K of gradient directions (50, 100 and 200), for a slice of 4804 voxels ( fibres,  MCMC iterations (3000 burn-in)).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3643787&req=5

pone-0061892-g006: Execution times for the MCMC GPU kernel using different number of threads per block .Results are shown for different number K of gradient directions (50, 100 and 200), for a slice of 4804 voxels ( fibres, MCMC iterations (3000 burn-in)).
Mentions: As in the case of the Levenberg-Marquardt kernel, when we choose the number of threads per block in the MCMC kernel, our target is to have simultaneously as many warps per SM as possible (to “hide” latencies from Global memory accesses). In the MCMC kernel, there are additional accesses, as the random numbers used during the MCMC are pre-generated and stored in the Global memory. If we use there will be idle threads inside the block and we will have unused resources (SPs and shared memory). If is not a multiple of K then the gradient directions will be distributed unevenly across threads. In addition, the limitations discussed in previous sections also exist, making the choice of the optimal less straightforward. We therefore evaluated performance for different numbers of and different numbers of K. Figure 6 shows some illustrative examples. As it can be observed, a of 64 gave on average the smallest execution times. We therefore chose for this kernel.

Bottom Line: With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands.We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version.We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, University of Murcia, Murcia, Spain. moises.hernandez@um.es

ABSTRACT
With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands. Modern graphics processing units (GPUs) are massively parallel processors that can execute simultaneously thousands of light-weight processes. In this study, we propose and implement a parallel GPU-based design of a popular method that is used for the analysis of brain magnetic resonance imaging (MRI). More specifically, we are concerned with a model-based approach for extracting tissue structural information from diffusion-weighted (DW) MRI data. DW-MRI offers, through tractography approaches, the only way to study brain structural connectivity, non-invasively and in-vivo. We parallelise the Bayesian inference framework for the ball & stick model, as it is implemented in the tractography toolbox of the popular FSL software package (University of Oxford). For our implementation, we utilise the Compute Unified Device Architecture (CUDA) programming model. We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version. We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

Show MeSH