GPU offloading with OpenMP

A quick guide to outline the most important aspects of GPU offloading with OpenMP in OpenFOAM context.

In this optional section, we’ll explore the use of GPUs with OpenFOAM and OpenMP.

👉

This article is incomplete, and it only supports NVIDIA GPUs.

Base benchmarks

The OpenMP-OpenFOAM-benchmarks repository provides a set of quick benchmarks to showcase the possible benefits of using OpenMP constructs on OpenFOAM loops. The test case is a simple computation pattern that is popular in meshless methods where a kernel function is usually used as a weight for a set of sample points to approximate a target value at a particular position.

The benchmarks feature the following speedups on Github machines (You can always consult reports from CI jobs for updated information):

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------

Original looping                               100             1    856.051 ms
                                        8.60526 ms    8.58056 ms    8.63656 ms
                                        141.078 us    116.117 us    201.586 us

CPU openMP looping                             100             1    471.849 ms
                                        4.72523 ms    4.70228 ms    4.77649 ms
                                        166.204 us    94.5203 us     334.01 us

The performance benefits of adding #pragma omp parallel for (check the difference of implementing openMPLoop() and originalLoop()) come basically for free. All you have to do is

Github machines provide no accelerators, so the code that is supposed to run on the GPU will just run on the CPU.

A few tasks are left to the reader:

Try to run the benchmarks locally and evaluate how good the GPU offloading is on your system.
Add -O0 to Make/options to disable filling of compile-time values and observe the difference in speedup.
Adapt openmpTests.C to work for vector operations too.
Increase the number of field elements (maybe by 10x fold) and see what happens!

GPU offloading with OpenMP

Base benchmarks #

Base benchmarks