Limits...
Accelerating fibre orientation estimation from diffusion weighted magnetic resonance imaging using GPUs.

Hernández M, Guerrero GD, Cecilia JM, García JM, Inuggi A, Jbabdi S, Behrens TE, Sotiropoulos SN - PLoS ONE (2013)

Bottom Line: With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands.We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version.We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, University of Murcia, Murcia, Spain. moises.hernandez@um.es

ABSTRACT
With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands. Modern graphics processing units (GPUs) are massively parallel processors that can execute simultaneously thousands of light-weight processes. In this study, we propose and implement a parallel GPU-based design of a popular method that is used for the analysis of brain magnetic resonance imaging (MRI). More specifically, we are concerned with a model-based approach for extracting tissue structural information from diffusion-weighted (DW) MRI data. DW-MRI offers, through tractography approaches, the only way to study brain structural connectivity, non-invasively and in-vivo. We parallelise the Bayesian inference framework for the ball & stick model, as it is implemented in the tractography toolbox of the popular FSL software package (University of Oxford). For our implementation, we utilise the Compute Unified Device Architecture (CUDA) programming model. We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version. We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

Show MeSH
Workflow in the MCMC kernel.(a) Workflow for a single iteration and a single parameter update describing how computation tasks are distributed between the threads of a block () in a case with  gradient directions. The calculation of the model-predicted signals for the different gradient directions is distributed as evenly as possible between threads within a thread block. The remaining tasks, which are not computationally demanding, are performed by a leader thread, while the rest of threads are waiting. (b) Workflow for a thread block of the MCMC kernel that performs all T iterations for all R parameters (i.e. for a voxel). Each block has  threads. The threads need to be synchronised at certain steps.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3643787&req=5

pone-0061892-g005: Workflow in the MCMC kernel.(a) Workflow for a single iteration and a single parameter update describing how computation tasks are distributed between the threads of a block () in a case with gradient directions. The calculation of the model-predicted signals for the different gradient directions is distributed as evenly as possible between threads within a thread block. The remaining tasks, which are not computationally demanding, are performed by a leader thread, while the rest of threads are waiting. (b) Workflow for a thread block of the MCMC kernel that performs all T iterations for all R parameters (i.e. for a voxel). Each block has threads. The threads need to be synchronised at certain steps.

Mentions: The random number generation is computationally expensive. Therefore, we decided to perform the calculation of all random numbers in a separate kernel, before the MCMC kernel execution takes place. Figures 4 and 5 show the underlying design of the MCMC kernel that efficiently addresses (2) and (3) according to the CUDA best practices [39]. Contrary to the Levenberg-Marquardt kernel, each voxel is processed by more than one thread and is assigned to a thread block of size . This design allows many, very light-weight threads ( threads) to fully occupy the hardware resources of the GPU.


Accelerating fibre orientation estimation from diffusion weighted magnetic resonance imaging using GPUs.

Hernández M, Guerrero GD, Cecilia JM, García JM, Inuggi A, Jbabdi S, Behrens TE, Sotiropoulos SN - PLoS ONE (2013)

Workflow in the MCMC kernel.(a) Workflow for a single iteration and a single parameter update describing how computation tasks are distributed between the threads of a block () in a case with  gradient directions. The calculation of the model-predicted signals for the different gradient directions is distributed as evenly as possible between threads within a thread block. The remaining tasks, which are not computationally demanding, are performed by a leader thread, while the rest of threads are waiting. (b) Workflow for a thread block of the MCMC kernel that performs all T iterations for all R parameters (i.e. for a voxel). Each block has  threads. The threads need to be synchronised at certain steps.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3643787&req=5

pone-0061892-g005: Workflow in the MCMC kernel.(a) Workflow for a single iteration and a single parameter update describing how computation tasks are distributed between the threads of a block () in a case with gradient directions. The calculation of the model-predicted signals for the different gradient directions is distributed as evenly as possible between threads within a thread block. The remaining tasks, which are not computationally demanding, are performed by a leader thread, while the rest of threads are waiting. (b) Workflow for a thread block of the MCMC kernel that performs all T iterations for all R parameters (i.e. for a voxel). Each block has threads. The threads need to be synchronised at certain steps.
Mentions: The random number generation is computationally expensive. Therefore, we decided to perform the calculation of all random numbers in a separate kernel, before the MCMC kernel execution takes place. Figures 4 and 5 show the underlying design of the MCMC kernel that efficiently addresses (2) and (3) according to the CUDA best practices [39]. Contrary to the Levenberg-Marquardt kernel, each voxel is processed by more than one thread and is assigned to a thread block of size . This design allows many, very light-weight threads ( threads) to fully occupy the hardware resources of the GPU.

Bottom Line: With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands.We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version.We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, University of Murcia, Murcia, Spain. moises.hernandez@um.es

ABSTRACT
With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands. Modern graphics processing units (GPUs) are massively parallel processors that can execute simultaneously thousands of light-weight processes. In this study, we propose and implement a parallel GPU-based design of a popular method that is used for the analysis of brain magnetic resonance imaging (MRI). More specifically, we are concerned with a model-based approach for extracting tissue structural information from diffusion-weighted (DW) MRI data. DW-MRI offers, through tractography approaches, the only way to study brain structural connectivity, non-invasively and in-vivo. We parallelise the Bayesian inference framework for the ball & stick model, as it is implemented in the tractography toolbox of the popular FSL software package (University of Oxford). For our implementation, we utilise the Compute Unified Device Architecture (CUDA) programming model. We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version. We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

Show MeSH