Limits...
Accelerating fibre orientation estimation from diffusion weighted magnetic resonance imaging using GPUs.

Hernández M, Guerrero GD, Cecilia JM, García JM, Inuggi A, Jbabdi S, Behrens TE, Sotiropoulos SN - PLoS ONE (2013)

Bottom Line: With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands.We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version.We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, University of Murcia, Murcia, Spain. moises.hernandez@um.es

ABSTRACT
With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands. Modern graphics processing units (GPUs) are massively parallel processors that can execute simultaneously thousands of light-weight processes. In this study, we propose and implement a parallel GPU-based design of a popular method that is used for the analysis of brain magnetic resonance imaging (MRI). More specifically, we are concerned with a model-based approach for extracting tissue structural information from diffusion-weighted (DW) MRI data. DW-MRI offers, through tractography approaches, the only way to study brain structural connectivity, non-invasively and in-vivo. We parallelise the Bayesian inference framework for the ball & stick model, as it is implemented in the tractography toolbox of the popular FSL software package (University of Oxford). For our implementation, we utilise the Compute Unified Device Architecture (CUDA) programming model. We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version. We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

Show MeSH
Typical NVIDIA GPU architecture.The GPU is comprised of a set of Streaming MultiProcessors (SM). Each SM is comprised of several Stream Processor (SP) cores, as shown for the NVIDIA’s Fermi architecture (a). The GPU resources are controlled by the programmer through the CUDA programming model, shown in (b).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3643787&req=5

pone-0061892-g001: Typical NVIDIA GPU architecture.The GPU is comprised of a set of Streaming MultiProcessors (SM). Each SM is comprised of several Stream Processor (SP) cores, as shown for the NVIDIA’s Fermi architecture (a). The GPU resources are controlled by the programmer through the CUDA programming model, shown in (b).

Mentions: All NVIDIA GPU platforms from the G80 architecture may be programmed using the CUDA programming model [24], which makes the GPU operate as a highly parallel computing device. Each GPU device is a scalable processor array consisting of a set of SIMT (Single Instruction Multiple Threads) [4], [35] Streaming Multiprocessors (SM), each containing several Stream Processors (SPs) (see Figure 1(a)). The GPU has a global scheduler (Giga Thread) for distributing the work to the SMs and a host interface. Different memory spaces are also available within a GPU, having different latencies, storage capacity and access methods. These memory spaces, ordered from low to high latency are: the register file (32768 32-bit registers per SM in NVIDIA compute capability devices 2.X), the shared memory/L1 cache (64 KB per SM), the L2 cache (768 KB) and the global memory (DRAM, 1 - 6 GB).


Accelerating fibre orientation estimation from diffusion weighted magnetic resonance imaging using GPUs.

Hernández M, Guerrero GD, Cecilia JM, García JM, Inuggi A, Jbabdi S, Behrens TE, Sotiropoulos SN - PLoS ONE (2013)

Typical NVIDIA GPU architecture.The GPU is comprised of a set of Streaming MultiProcessors (SM). Each SM is comprised of several Stream Processor (SP) cores, as shown for the NVIDIA’s Fermi architecture (a). The GPU resources are controlled by the programmer through the CUDA programming model, shown in (b).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3643787&req=5

pone-0061892-g001: Typical NVIDIA GPU architecture.The GPU is comprised of a set of Streaming MultiProcessors (SM). Each SM is comprised of several Stream Processor (SP) cores, as shown for the NVIDIA’s Fermi architecture (a). The GPU resources are controlled by the programmer through the CUDA programming model, shown in (b).
Mentions: All NVIDIA GPU platforms from the G80 architecture may be programmed using the CUDA programming model [24], which makes the GPU operate as a highly parallel computing device. Each GPU device is a scalable processor array consisting of a set of SIMT (Single Instruction Multiple Threads) [4], [35] Streaming Multiprocessors (SM), each containing several Stream Processors (SPs) (see Figure 1(a)). The GPU has a global scheduler (Giga Thread) for distributing the work to the SMs and a host interface. Different memory spaces are also available within a GPU, having different latencies, storage capacity and access methods. These memory spaces, ordered from low to high latency are: the register file (32768 32-bit registers per SM in NVIDIA compute capability devices 2.X), the shared memory/L1 cache (64 KB per SM), the L2 cache (768 KB) and the global memory (DRAM, 1 - 6 GB).

Bottom Line: With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands.We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version.We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, University of Murcia, Murcia, Spain. moises.hernandez@um.es

ABSTRACT
With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands. Modern graphics processing units (GPUs) are massively parallel processors that can execute simultaneously thousands of light-weight processes. In this study, we propose and implement a parallel GPU-based design of a popular method that is used for the analysis of brain magnetic resonance imaging (MRI). More specifically, we are concerned with a model-based approach for extracting tissue structural information from diffusion-weighted (DW) MRI data. DW-MRI offers, through tractography approaches, the only way to study brain structural connectivity, non-invasively and in-vivo. We parallelise the Bayesian inference framework for the ball & stick model, as it is implemented in the tractography toolbox of the popular FSL software package (University of Oxford). For our implementation, we utilise the Compute Unified Device Architecture (CUDA) programming model. We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version. We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation.

Show MeSH