GPU Acceleration
Repurposing GPUs
|
The CUDA architecture for GPU "cores" by NVIDIA have made a number of important revisions: floating point units now meet IEEE standards with native 32-bit precision and improved 64-bit emulation, ECC memory safeguards against memory faults, on-die cache eliminates system memory lookups, improved scheduling of concurrent kernels maintains a high load (think hyperthreading), and fully programmable streaming processors support C/C++ style programming. Users no longer need to rewrite calculations in terms of native geometry commands. And of course this list grows as NVIDIA engineers evolve their technology.
The traditional approach to designing a CPU is to offer highly accelerated tasks with lots of operations possible in a single clock cycle. They make use of large, extensive caches in order to do so and house up to 16 cores (and growing) on a die running at 2-4 GHz. They still offer superior single-threaded performance for nonparallel tasks. A GPU, traditionally designed for rendering games, is inherently parallel and synchronized. It can house anywhere from 300-600 streaming processors at speeds of 400-1000 MHz. They offer a high throughput but achieve simpler and fewer operations per clock. Thus, it has been the focus of several MD programs to offload intensive parallel tasks to the GPU. For molecular modeling of biological membranes, the nonbonded force calculation is the focus of GPU acceleration.
GPU acceleration
|
Our CUDA workstation has tested an 80k atom, lipid bilayer simulation in NAMD.
num GPU | num threads | speedup (vs 1-1) | ||
2 | 2 | 1.97 | ||
3 | 3 | 2.82 | ||
4 | 4 | 3.62 | ||
speedup (vs 2-2) | ||||
2 | 4 | 3.78 | 1.92 | |
2 | 6 | 4.91 | 2.50 | |
2 | 8 | 5.36 | 2.72 | |
2 | 10 | 5.57 | 2.83 | |
2 | 12 | 6.40 | 3.26 | |
2 | 14 | 5.84 | 2.97 | |
2 | 16 | 5.85 | 2.97 | |
speedup (vs 4-4) | ||||
4 | 4 | 3.62 | 1.84 | - |
4 | 6 | 5.23 | 2.66 | 1.45 |
4 | 8 | 6.73 | 3.42 | 1.86 |
4 | 10 | 7.38 | 3.75 | 2.04 |
4 | 12 | 7.79 | 3.96 | 2.16 |
4 | 14 | 7.99 | 4.07 | 2.21 |
4 | 16 | 8.11 | 4.12 | 2.24 |
Note that as a hardware limitation, there is reduced bandwidth between all four GPU cards that is not present in two-card benchmarks. Notice how NAMD responds to both threading and raw resources while it also feels the effects of diminishing returns. Threading increases utilization but also increases overhead. You can see this negative effect in the 2-12 and 2-16 runs.
Back to home page.