Tutorial at HiPEAC2021:
Instrumentation and Modeling of Performance and Power Consumption for Massively Parallel Processors
In conjunction with HiPEAC Conference 2021, Budapest, Hungary, January 19, 2021
GPUs have established themselves as accelerators for various applications, including scientific-technical computing, data analysis, and machine learning. However, to exploit the maximum performance is not a trivial task, since deep knowledge of the CUDA-specific ecosystem and GPU architecture is needed, and GPUs massively parallel processors also contribute to this complexity. Additionally, heterogeneous computing also arises, as GPUs are not a full replacement of CPUs as general-purpose processors. Furthermore, a variety of different GPU exists, both with regard to GPU generation but also variant of a certain generation. For a given application, the selection of the best processors is for neither provision nor scheduling intuitive.
For a better understanding of performance and power implications of code variants and optimizations, instrumentation can be helpful as it is more selective than simply relying on (hardware) performance counters. Furthermore, performance counters are often limited in coverage and grouping. The characterization done by such instrumentation can therefore be of substantial help for the programmers, but can also be biased for predictive modeling of performance and power consumption. If such predictions are fast and portable across different GPUs, they are useful for various provisioning and scheduling tasks, and in general reasoning about optimal in heterogeneous systems.
In this tutorial we present our current research on methodology and tooling of instrumentation as well as performance and power prediction for GPUs. For efficient instrumentation, we present CUDA Flux (https://github.com/UniHD-CEG/cuda-flux) , which is open source and based on LLVM tool stacks. For prediction of time and power consumption, we present a tool named GPU Mangrove (https://github.com/UniHD-CEG/gpu-mangrove) , which is fast, accurate, portable and publicly available. As the model only relies on hardware-independent features, the workflow can be easily exported to any type of GPUs. The tutorial program aims to provide attendees with practical experience on the instrumentation and modeling of performance and power consumption for GPUs.
The first part of this tutorial will focus on an introduction of these tools, combined with some background information such as training and validation, GPU architecture and programming. The second half of this tutorial will provide attendees a prepared environment to experiment on each step, from instrumentation over model building to validation. More details about the schedule are to follow soon.
While there exists a plethora of related work on performance and power modeling, best to our knowledge GPU Mangrove is one of very few being publicly available, and furthermore distinguishes by being fast (Random Forest models) and portable (solely using hardware-independent features). With this tutorial we gear to make the community aware of the methodology and our publicly available tools, and create an understanding of possibilities and limitations of instrumentation and modeling. We believe such content to be of value across different areas in the community, ranging from embedded systems to high-performance computing clusters.
Details to follow.
Chen Song, Interdisciplinary Research Center for Scientific Computing (IWR), Heidelberg University & Heidelberg Institute for Theoretical Studies (HITS) (chen.song(at)iwr.uni-heidelberg.de)
Lorenz Braun, Computing Systems Group, Institute for Computer Engineering (ZITI), Heidelberg University (lorenz.braun(at)ziti.uni-heidelberg.de)
Holger Fröning, Computing Systems Group, Institute for Computer Engineering (ZITI), Heidelberg University (holger.froening(at)ziti.uni-heidelberg.de)
 Lorenz Braun, Holger Fröning, CUDA Flux: A Lightweight Instruction Profiler for CUDA Applications, Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS19), held as part of ACM/IEEE Supercomputing 2019 (SC19), Denver, CO, USA. [link] [github]
 Lorenz Braun, Sotirios Nikas, Chen Song, Vincent Heuveline, Holger Fröning, A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels. ArXiv:2001.07104 [Cs], Jan. 2020. http://arxiv.org/abs/2001.07104 [github]