The GPU Mekong Project - Simplified Multi-GPU Programming

About

The main objective of (GPU) Mekong is to provide a simplified path to scale out the execution of GPU programs from one GPU to almost any number, independent of whether the GPUs are located within one host or distributed at the cloud or cluster level. Unlike existing solutions, this work proposes to maintain the GPU’s native programming model, which relies on a bulk-synchronous, thread-collective execution; that is, no hybrid solutions like OpenCL/CUDA programs combined with message passing are required. As a result, we can maintain the simplicity and efficiency of GPU computing in the scale-out case, together with a high productivity and performance.

In essence, Mekong allows for resource aggregation of compute and memory without exposing the typical programming complexities that are associated with such aggregations. Instead of having multiple GPU devices with a complex, partitioned Bulk Synchronous Parallel (BSP) domain and multiple memory resources within a partitioned Global Address Space (GAS) domain, Mekong aggregates these resources in a way such that the user only sees flat domains, while automated techniques ensure that partitioning is leveraged for improved locality.

Leveraging the beauty of data-parallel programming styles for simplified BSP and GAS aggregations

We observe that data-parallel languages like OpenCL or CUDA can greatly simplify parallel programming, so that hybrid solutions like sequential code enriched with vector instructions are not required. The inherent domain decomposition principle for these languages ensures a fine granularity when partitioning the code, typically resulting in a mapping of one single output element to one thread and reducing the need for work aglommeration. The BSP programming paradigm and its associated slackness regarding the ratio of virtual to physical processors allows effective latency hiding techniques that make large caching structures obsolete. At the same time, a typical BSP code exhibits substantial amounts of locality, as the rather flat memory hierarchy of thread-parallel processors has to rely on large amounts of data reuse to keep their vast amount of processing units busy.

In the GPU Mekong project, we leverage these observations to design a compile- and run-time system that allows for programming an arbitrary number of thread-parallel processors like GPUs with a single OpenCL (future: CUDA) program. As opposed to other state-of-the-art research, the actual number of GPUs is hidden from the user at design time and during the execution, allowing an easy migration from single-device execution to multi-device.

We base our approach on compilation techniques including static code analysis and code transformations regarding host and device code. We initially focus on multiple GPU devices within one machine boundary (a single computer), allowing us to hide the complications of multi-device programming from the user (cudaSetDevice, streams, events, and similar). Our initial tool stack is based on OpenCL programs as input, LLVM as the compilation infrastructure and CUDA backends to orchestrate data movement and kernel launches on any number of GPUs.

Future efforts will include support for multiple GPUs at cluster/system level, so one can leverage the availability of a large number of GPUs within a cluster, cloud or similar by programming them with a single data-parallel program.

About the name

With Mekong we are actually referring to the Mekong Delta, a huge river delta in southwestern Vietnam that transforms from one of the longest rivers of the world into an abundant number of distributaries, before this huge water stream is finally emptied in the South China Sea. It forms a large triangle that embraces a variety of physical landscapes, and is famous among backpackers and tourists as travel destination.

What actually motivated us to choose Mekong as a name, is the fact that a single huge stream is transformed into a large number of distributaries; an effect that we are also seeing in our GPU project: Mekong as a project gears to transform a single data stream into a large number of smaller streams that embrace smaller islands (computational units, memory) that mostly operate independently except for interactions like data distribution, communication, and synchronization.

The Mekong project was previously called GCUDA, and you might find a few reference to this old name.

About the researchers

The Mekong project was initiated by the Computing Systems Group (CSG) (formerly: Computer Engineering Group), Institute of Computer Engineering at Heidelberg University, Germany. It initially received funding in form a Google Faculty Research Award, and meantime is funded by the German Ministry for Education and Research (BMBF - FKZ: 01IH16007). For the BMBF project, the Engineering Mathematics and Computing Lab (EMCL), also from Heidelberg University, joined as a peer partner.

Current team:

Holger Fröning, CSG, PI (holger.froening (at) ziti.uni-heidelberg.de)
Vincent Heuveline, EMCL, co-PI (vincent.heuveline (at) ziti.uni-heidelberg.de)
Lorenz Braun, CSG, PhD student (lorenz.braun (at) stud.uni-heidelberg.de)
Song Chen, EMCL, Post-Doc (chen.song (at) iwr.uni-heidelberg.de)
Yannick Emonds, CSG, PhD student (yannick.emonds (at) ziti.uni-heidelberg.de)

Associated partners

Tobias Grosser (ETHZ)
Axel Köhler, Stefan Kramer (NVIDIA Germany)

Previous people

Simon Gawlok, PhD student (simon.gawlok (at) uni-heidelberg.de)
Alexander Matz, PhD student (alexander.mat (at) ziti.uni-heidelberg.de)
Sotirios Nikas, EMCL, PhD student (sotirios.nikas (at) uni-heidelberg.de)

For additional questions or comments, please contact the PI: Holger Fröning, holger.froening (at) ziti.uni-heidelberg.de.

Availability

An early prototype is available here: https://github.com/UniHD-CEG/mekong-cuda. Please note that this prototype is WIP and your mileage may vary.

An associated analysis tool for memory tracing is available here: https://github.com/UniHD-CEG/cuda-memtrace, which is also used for a portable GPU performance and power model (https://github.com/UniHD-CEG/gpu-mangrove).

Publications

Peer-reviewed Publications and Preprints

[TACO2021] Lorenz Braun, Sotirios Nikas, Chen Song, Vincent Heuveline, Holger Fröning, A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels, ACM Transactions on Architecture and Code Optimization (TACO). ACM Trans. Archit. Code Optim. 18, 1, Article 7 (January 2021). [doi][preprint] [github]

[P2S2-2020] Alexander Matz, Holger Fröning, Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation, 13th International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), in conjunction with ICPP2020, August 17, 2020, Edmonton, AB, Canada. (accepted for publication) [article] [github]

[ARXIV2020] Lorenz Braun, Sotirios Nikas, Chen Song, Vincent Heuveline, Holger Fröning, A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels. ArXiv:2001.07104 [Cs], Jan. 2020.

[PMBS2019] Lorenz Braun, Holger Fröning, CUDA Flux: A Lightweight Instruction Profiler for CUDA Applications, Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS19), held as part of ACM/IEEE Supercomputing 2019 (SC19), Denver, CO, USA. [article] [github]

[GPGPU2019] Alexander Matz, Holger Fröning, Quantifying the NUMA Behavior of Partitioned GPGPU Applications, 12th Workshop on General Purpose Processing Using GPU (GPGPU 2019) @ ASPLOS 2019, April 13, Providence, RI, USA. (acceptance rate: 40%, 6/15)

[HIPEAC2016MULTIPROG] Alexander Matz, Mark Hummel, Holger Fröning, Exploring LLVM Infrastructure for Simplified Multi-GPU Programming, Ninth International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2016), in conjunction with HiPEAC 2016, Prague, Czech Republic, Jan. 18, 2016. (acceptance rate 73.3%, 11/15)

Posters and other contributions

Lorenz Braun, Sotirios Nikas, Chen Song, Vincent Heuveline, Holger Fröning, GPU Mangrove - Execution Time and Power Prediction, International Supercomputer Conference (ISC), Poster, June 2020.
Alexander Matz, Holger Fröning, Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation, Student Research Competition, International Symposium on Code Generation and Optimization (CGO18), February 2018.
Alexander Matz, Holger Fröning, GPU Mekong: Simplified Multi-GPU Programming using Automated Partitioning, International Conference for High Performance Computing, Networking, Storage, and Analysis (SC17), November 2017.
Lorenz Braun, Holger Fröning, Leveraging Code Transformations for Simplified Multi-GPU Programming, ACACES Summer School, July 2017.
Alexander Matz, Christoph Klein, Holger Fröning, GPU Mekong: Simplified Multi-GPU Programming using Automated Partitioning, NVIDIA GPU Technology Conference (GTC), Poster, May 8-11, 2017, San Jose, California, US.
Alexander Matz, Christoph Klein, Holger Fröning, Static Analysis for Automated Partitioning of Single-GPU Kernels, 2016 European LLVM Developers' Meeting. March 2016.