GPU Acceleration

Repurposing GPUs

32-bit Gflops for GPUs (AMD and NVIDIA) and CPUs for 2007-2013. — GPUs are currently kings of floating point operations. The performance gap is smaller but still present in 64-bit operations.

High performance clusters have, in recent history, turned their attention to general purpose GPU (GPGPU) acceleration. GPUs already exist in an upgradeable and expandable format and are economical candidates to retrofit existing clusters. They offer the highest performance per watt, performance per dollar, and raw computing power in high performance cluster applications.

The CUDA architecture for GPU "cores" by NVIDIA have made a number of important revisions: floating point units now meet IEEE standards with native 32-bit precision and improved 64-bit emulation, ECC memory safeguards against memory faults, on-die cache eliminates system memory lookups, improved scheduling of concurrent kernels maintains a high load (think hyperthreading), and fully programmable streaming processors support C/C++ style programming. Users no longer need to rewrite calculations in terms of native geometry commands. And of course this list grows as NVIDIA engineers evolve their technology.

The traditional approach to designing a CPU is to offer highly accelerated tasks with lots of operations possible in a single clock cycle. They make use of large, extensive caches in order to do so and house up to 16 cores (and growing) on a die running at 2-4 GHz. They still offer superior single-threaded performance for nonparallel tasks. A GPU, traditionally designed for rendering games, is inherently parallel and synchronized. It can house anywhere from 300-600 streaming processors at speeds of 400-1000 MHz. They offer a high throughput but achieve simpler and fewer operations per clock. Thus, it has been the focus of several MD programs to offload intensive parallel tasks to the GPU. For molecular modeling of biological membranes, the nonbonded force calculation is the focus of GPU acceleration.

GPU acceleration

Some of the first results from a GPU accelerated NAMD simulation of APOA1 protein at UIUC clearly show the possible benefits. These types of performance gains allow one CUDA workstation to replace several nodes in a cluster, reducing equipment cost.

Our CUDA workstation has tested an 80k atom, lipid bilayer simulation in NAMD.

num GPU	num threads	speedup (vs 1-1)
2	2	1.97
3	3	2.82
4	4	3.62
			speedup (vs 2-2)
2	4	3.78	1.92
2	6	4.91	2.50
2	8	5.36	2.72
2	10	5.57	2.83
2	12	6.40	3.26
2	14	5.84	2.97
2	16	5.85	2.97
				speedup (vs 4-4)
4	4	3.62	1.84	-
4	6	5.23	2.66	1.45
4	8	6.73	3.42	1.86
4	10	7.38	3.75	2.04
4	12	7.79	3.96	2.16
4	14	7.99	4.07	2.21
4	16	8.11	4.12	2.24

Note that as a hardware limitation, there is reduced bandwidth between all four GPU cards that is not present in two-card benchmarks. Notice how NAMD responds to both threading and raw resources while it also feels the effects of diminishing returns. Threading increases utilization but also increases overhead. You can see this negative effect in the 2-12 and 2-16 runs.

Back to home page.