For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
Computation-Communication Overlap of Linpack on a GPU-Accelerated PC Cluster
Junichi OHMURA Takefumi MIYOSHI Hidetsugu IRIE Tsutomu YOSHINAGA
IEICE TRANSACTIONS on Information and Systems
Publication Date: 2011/12/01
Online ISSN: 1745-1361
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Section on Parallel and Distributed Computing and Networking)
parallel processing, multi-core processor, GPU, computation-communication overlap,
Full Text: PDF(636.5KB)>>
In this paper, we propose an approach to obtaining enhanced performance of the Linpack benchmark on a GPU-accelerated PC cluster connected via relatively slow inter-node connections. For one node with a quad-core Intel Xeon W3520 processor and a NVIDIA Tesla C1060 GPU card, we implement a CPU–GPU parallel double-precision general matrix–matrix multiplication (dgemm) operation, and achieve a performance improvement of 34% compared with the GPU-only case and 64% compared with the CPU-only case. For an entire 16-node cluster, each node of which is the same as the above and is connected with two gigabit Ethernet links, we use a computation-communication overlap scheme with GPU acceleration for the Linpack benchmark, and achieve a performance improvement of 28% compared with the GPU-accelerated high-performance Linpack benchmark (HPL) without overlapping. Our overlap GPU acceleration solution uses overlaps in which the main inter-node communication and data transfer to the GPU device memory are overlapped with the main computation task on the CPU cores. These overlaps use multi-core processors, which almost all of today's high-performance computers use. In particular, as well as using a CPU core for communication tasks, we also simultaneously use other CPU cores and the GPU for computation tasks. In order to enable overlap between inter-node communication and computation tasks, we eliminate their close dependence by breaking the main computation task into smaller tasks and rescheduling. Based on a scheme in which part of the CPU computation power is simultaneously used for tasks other than computation tasks, we experimentally find the optimal computation ratio for CPUs; this ratio differs from the case of parallel dgemm operation of one node.