We have been asked by our ECP level 2 manager to support OpenMP in VTK-m. The rational behind supporting OpenMP is that many apps in ECP use OpenMP and there is concern that our use of TBB will interfere with the code using OpenMP. (This fear has basis. We have already encountered such problems with interaction.)
In principle, adding OpenMP to VTK-m is straightforward; we simply need to create a new device adapter that uses OpenMP. However, there are multiple approaches we can have for the implementation. This document captures the design decisions of an OpenMP device adapter.
Option 1: Thrust
The Thrust parallel algorithms library provides an OpenMP backend. Thus, we can leverage its parallel algorithms.
- Low barrier of entry to parallel algorithms created by a larger community
- Already have a Thrust interface for CUDA. Should be able to adapt it for OpenMP.
- Our Thrust code might not be general enough to handle backends that are not CUDA. There might be substantial refactoring.
- In the past, we have found OpenMP with Thrust slower than TBB. The following (not mutually exclusive) items might have been factors.
- The primary developers of Thrust design for CUDA. Other backends may not have had as much optimization.
- The parallel for operation breaks scheduling into single units instead of chucks, which is likely less optimal for CPU processing.
- Adds an extra library dependency. (Thrust comes with CUDA, but would have to be either installed or brought as a TPL for cases when CUDA is not compiled.)
Option 2: Hand Code
Rather than rely on any third party implementation of OpenMP parallel algorithms, we develop directly using OpenMP pragmas. We already have generic implementations of all device adapter algorithms, so a first pass could be implemented quickly.
- No dependence on external libraries nor the addition to any TPL in the source code.
- Optimizations can be contributed to the generic algorithm layer. That can help support future device adapter interfaces.
- Cannot take advantage of a larger community of parallel algorithm developers.
Option 3: Some other TPL
There might be some other TPL designed for OpenMP that we could either link to or absorb. That said, without finding such a set of code, it's hard to rationalize about it.
Notes from discussion with Rob Maynard
Rob suggests we go with option 2. The main problem with option 1 is that the Thrust implementation for OpenMP is pretty weak. Even a simple hand coded implementation should be about as good.
Rob suggests requiring OpenMP 4. OpenMP 4 comes with several useful features that we would like to take advantage of. Although the support for OpenMP 4 is not good on Windows and Mac, our primary use case is for ECP and we should be able to get a reasonable compiler for Linux/Unix.
Kitware will take the lead for writing the implementation. Rob is estimating this taking 0.25 FTE to implement. We propose dropping the Surface Metrics deliverable, which has a similar FTE estimate, and replacing it with an OpenMP deliverable. We also need to compress our work schedule, so this will be a good candidate for that compression.