Non-Blocking P2P comms

Non-Blocking point-to-point communication in OpenFOAM: The more efficient approach to communication by overlapping computation!

Lecture video


Loading content.

Module transcript

In the previous module, we’ve talked about an inherent issue with blocking comms: the possibility of a deadlock.

To find a radical solution to this problem, and not requiring a specific order of send and receive operations, the MPI calls need to return without actually sending the data. For this type of comms, OpenFOAM uses Pstream::nonBlocking.


It simply returns immediately after initiating the non blocking operation and expects the program to wait for the operations to complete elsewhere in the code.

Remember that we can’t move the car unless we check the tyres are in place, but we can do other operations meanwhile, in this case, the process can do unrelated computations while waiting for the communication to finish. So, it’s a form of a parallel pipeline.


Note that doing so completely gets rid of the risk of deadlocking while it reduces waiting time for processes. It also helps avoiding unnecessary synchronization between processes. If each process can work independently on its data, there is no need to block until data is synchronized.


Let’s look at a short code snippet where many processes communicate to each other in pairs, so in a point-to-point fashion.

On the second line, we create a local parallel buffers object. Note that all processes will execute this code, so each process creates its own parallel buffers object, running in non-blocking mode as we can see.


Each process then initiates the send operations. In this simple example, each process sends boundary patch info to the corresponding neighboring process.

Neighboring, in the context, means the process on the other end of the processor patch.


On line nine, we wait for the send operations attached to the parallel buffers object to finish. Note that none of the previous operations block, so all processes continue to the receiving part, where each process reads the patch info into a local variable.


Also, note that any computations after the receive calls, which are not related to the communication, will be executed while the communication is still ongoing.

This type of communication simplifies your code considerably, just think of the effort needed to get operation order right if we were to use blocking comms.

Although, starting with receives on all processes will most likely cause no issues on most MPI implementations in this particular case.


But again, that is hard to even test as the number of processes is a dynamic property, and the number of communication calls will also depend on the specific mesh and decomposition method used.


Efficiency-wise, overlapping communication and computation is certainly beneficial. I haven’t tested messages under 8 Megs because MPI algorithms are kind of eager, and might switch to non-blocking mode even if the blocking functions are called for small messages. But for large messages we can get up to 50% improvement compared to the blocking comms.


Here, the orange region denotes time spent on unrelated computation, just dummy operations on large arrays.

For the blocking calls, the remaining time is the actual communication time. And, as you can see, a big chunk of this communication time is actually processes just waiting around, because for non-blocking comms, where waiting time is minimal, the total time is much shorter.


Now, if you have experience with MPI calls, it might be useful to know how OpenFOAM comms modes relate to MPI calls, at least for the sending operations.

So let me quickly remind you again:

  • The scheduled comms do a standard MPI send which might buffer or do a synchronous send.
  • The blocking comms do a buffered MPI send which copies your data and returns, but expects a receive to show up.
  • The non-blocking comms do an MPI “I” send
  • There are also two more send operations, ready and synchronous send, which are not used in OpenFOAM, well, at least, not explicitly; MPI can fall back to these on its own if it deems it necessary.

Downloads

⇩ Lecture slides