GPU offloading with OpenMP
A quick guide to outline the most important aspects of GPU offloading with OpenMP in OpenFOAM context.
On this page
In this optional section, we’ll explore the use of GPUs with OpenFOAM and OpenMP.
Base benchmarks
The OpenMP-OpenFOAM-benchmarks repository provides a set of quick benchmarks to showcase the possible benefits of using OpenMP constructs on OpenFOAM loops. The test case is a simple computation pattern that is popular in meshless methods where a kernel function is usually used as a weight for a set of sample points to approximate a target value at a particular position.
The benchmarks feature the following speedups on Github machines (You can always consult reports from CI jobs for updated information):
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
Original looping 100 1 856.051 ms
8.60526 ms 8.58056 ms 8.63656 ms
141.078 us 116.117 us 201.586 us
CPU openMP looping 100 1 471.849 ms
4.72523 ms 4.70228 ms 4.77649 ms
166.204 us 94.5203 us 334.01 us
The performance benefits of adding #pragma omp parallel for
(check the difference of implementing openMPLoop() and originalLoop()) come basically for free. All you have to do is
Github machines provide no accelerators, so the code that is supposed to run on the GPU will just run on the CPU.
A few tasks are left to the reader:
- Try to run the benchmarks locally and evaluate how good the GPU offloading is on your system.
- Add
-O0
to Make/options to disable filling of compile-time values and observe the difference in speedup. - Adapt openmpTests.C to work for vector operations too.
- Increase the number of field elements (maybe by 10x fold) and see what happens!