Limits...
Accelerating fibre orientation estimation from diffusion weighted magnetic resonance imaging using GPUs.

Hernández M, Guerrero GD, Cecilia JM, García JM, Inuggi A, Jbabdi S, Behrens TE, Sotiropoulos SN - PLoS ONE (2013)

Bottom Line: With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands.We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version.We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, University of Murcia, Murcia, Spain. moises.hernandez@um.es

ABSTRACT
With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands. Modern graphics processing units (GPUs) are massively parallel processors that can execute simultaneously thousands of light-weight processes. In this study, we propose and implement a parallel GPU-based design of a popular method that is used for the analysis of brain magnetic resonance imaging (MRI). More specifically, we are concerned with a model-based approach for extracting tissue structural information from diffusion-weighted (DW) MRI data. DW-MRI offers, through tractography approaches, the only way to study brain structural connectivity, non-invasively and in-vivo. We parallelise the Bayesian inference framework for the ball & stick model, as it is implemented in the tractography toolbox of the popular FSL software package (University of Oxford). For our implementation, we utilise the Compute Unified Device Architecture (CUDA) programming model. We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version. We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

Show MeSH
Distribution of resources for the CUDA kernel that performs the Levenberg-Marquardt algorithm.Voxels are assigned to threads of CUDA blocks. Each CUDA block is comprised of  threads and processes  voxels ( was used in this study).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3643787&req=5

pone-0061892-g003: Distribution of resources for the CUDA kernel that performs the Levenberg-Marquardt algorithm.Voxels are assigned to threads of CUDA blocks. Each CUDA block is comprised of threads and processes voxels ( was used in this study).

Mentions: The first CUDA kernel performs the Levenberg-Marquardt algorithm. This kernel maps each CUDA thread to a voxel (see Figure 3), and it launches as many threads as voxels contained in a particular slice. Because processing of different voxels is totally independent, the threads do not need to synchronize. It is noteworthy that each thread must compute all steps of the Levenberg-Marquardt algorithm using large intermediate structures (the size of the structures depends on the number of parameters to fit and the number of gradient directions K of the input dataset). That involves managing many hardware resources and on-chip memories, specifically the registers, which are limited to a maximum number of 64 per thread, at least up to NVIDIA’s Fermi architecture used here. We should point out that for input datasets with a high number K of diffusion-sensitising directions, a thread may run out of registers and therefore need to utilise the Global memory. This will inevitably increase latencies each time data have to be read or written. The more recent Kepler architecture [38] supports up to 255 registers per thread and will potentially improve performance.


Accelerating fibre orientation estimation from diffusion weighted magnetic resonance imaging using GPUs.

Hernández M, Guerrero GD, Cecilia JM, García JM, Inuggi A, Jbabdi S, Behrens TE, Sotiropoulos SN - PLoS ONE (2013)

Distribution of resources for the CUDA kernel that performs the Levenberg-Marquardt algorithm.Voxels are assigned to threads of CUDA blocks. Each CUDA block is comprised of  threads and processes  voxels ( was used in this study).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3643787&req=5

pone-0061892-g003: Distribution of resources for the CUDA kernel that performs the Levenberg-Marquardt algorithm.Voxels are assigned to threads of CUDA blocks. Each CUDA block is comprised of threads and processes voxels ( was used in this study).
Mentions: The first CUDA kernel performs the Levenberg-Marquardt algorithm. This kernel maps each CUDA thread to a voxel (see Figure 3), and it launches as many threads as voxels contained in a particular slice. Because processing of different voxels is totally independent, the threads do not need to synchronize. It is noteworthy that each thread must compute all steps of the Levenberg-Marquardt algorithm using large intermediate structures (the size of the structures depends on the number of parameters to fit and the number of gradient directions K of the input dataset). That involves managing many hardware resources and on-chip memories, specifically the registers, which are limited to a maximum number of 64 per thread, at least up to NVIDIA’s Fermi architecture used here. We should point out that for input datasets with a high number K of diffusion-sensitising directions, a thread may run out of registers and therefore need to utilise the Global memory. This will inevitably increase latencies each time data have to be read or written. The more recent Kepler architecture [38] supports up to 255 registers per thread and will potentially improve performance.

Bottom Line: With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands.We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version.We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, University of Murcia, Murcia, Spain. moises.hernandez@um.es

ABSTRACT
With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands. Modern graphics processing units (GPUs) are massively parallel processors that can execute simultaneously thousands of light-weight processes. In this study, we propose and implement a parallel GPU-based design of a popular method that is used for the analysis of brain magnetic resonance imaging (MRI). More specifically, we are concerned with a model-based approach for extracting tissue structural information from diffusion-weighted (DW) MRI data. DW-MRI offers, through tractography approaches, the only way to study brain structural connectivity, non-invasively and in-vivo. We parallelise the Bayesian inference framework for the ball & stick model, as it is implemented in the tractography toolbox of the popular FSL software package (University of Oxford). For our implementation, we utilise the Compute Unified Device Architecture (CUDA) programming model. We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version. We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

Show MeSH