Parallelism in Aboria

OpenMP
CUDA

Aboria can use OpenMP or CUDA to utilise multiple cores or Nvidia GPUs that you might have. In general, Aboria uses the parallel algorithms and vectors provided with Thrust to do this. From there, you can either use Thrust's OpenMP or CUDA backends to provide the type of parallelism you wish. However, there are a few parts of Aboria that are OpenMP only (notably the entirety of the symbolic and kernel APIs).

OpenMP

You don't have to do much to start using OpenMP for the high level symbolic or kernel interfaces, all that is required is that you install OpenMP and Thrust and add the HAVE_THRUST compiler definition (see aboria.installation_and_getting_started). For lower-level programming you will need to add a few pragmas to your code, a few examples of which are discussed below.

First, let's create a set of N particles as usual

const size_t N = 100;
ABORIA_VARIABLE(neighbour_count, int, "neighbour_count");
typedef Particles<std::tuple<neighbour_count>, 2> particle_t;
typedef particle_t::position position;
particle_t particles(N);

Now we will loop through the particles and set their initial positions randomly. In order to use an OpenMP parallel for loop, we will stick to a simple index-based loop for loop to iterate through the particles, like so

#pragma omp parallel for
    for (size_t i = 0; i < particles.size(); ++i) {
      std::uniform_real_distribution<double> uniform(0, 1);
      auto gen = get<generator>(particles)[i];
      get<position>(particles)[i] = vdouble2(uniform(gen), uniform(gen));
    }

Now we can initialise the neighbourhood search data structure. Note that all creation and updates to the spatial data structures are run in parallel.

	Note
	currently the only data structure that is created or updated in serial is `Aboria::KdtreeNanoflann`. All the rest are done in parallel using either OpenMP or CUDA

particles.init_neighbour_search(
    vdouble2::Constant(0), vdouble2::Constant(1), vbool2::Constant(false));

We will use Aboria's range search to look for neighbouring pairs within a cutoff, and once again use OpenMPs parallel loop. All queries to the spatial data structures are thread-safe and can be used in parallel.

    const double radius = 0.1;

#pragma omp parallel for
    for (size_t i = 0; i < particles.size(); ++i) {
      for (auto j = euclidean_search(particles.get_query(),
                                     get<position>(particles)[i], radius);
           j != false; ++j) {
        ++get<neighbour_count>(particles)[i];
      }
    }

In general, that is 90% of what you need to know, just add a couple of OpenMP pragmas to your loops and you are ready to go!

CUDA

	Caution
	CUDA support in Aboria is experimental, and is not tested regularly. We welcome feedback by any CUDA users if anything doesn't work for you

Writing CUDA compatible code is slightly more involved. Aboria uses the Thrust library for CUDA parallism, and follows similar patterns (i.e. STL-like).

Most importantly, we need to make sure that all the particle data is contained in vectors that are stored on the GPU. To do this we use a thrust::device_vector as the base storage vector for our particles class

	Note
	we want to use the type `thrust_neighbour_count` within a device function, so we need to define this type outside any host functions (including main).

ABORIA_VARIABLE(thrust_neighbour_count, int, "thrust_neighbour_count");

int main() {

typedef Particles<std::tuple<thrust_neighbour_count>, 2,
                  thrust::device_vector, CellListOrdered>
    thrust_particle_t;
thrust_particle_t thrust_particles(N);

Since all our data is on the device, we cannot use raw for loops to access this data without copying it back to the host, an expensive operation. Instead, Thrust provides a wide variety of parallel algorithms to manipulate the data. Aboria's Aboria::zip_iterator is compatible with the Thrust framework, so can be used in a similar fashion to Thrust's own zip_iterator (except, unlike Thrust's zip_iterator, we can take advantage of Aboria's tagged reference and value_types).

We can use Thrust's tabulate algorithm to loop through the particles and set their initial positions randomly.

thrust::tabulate(get<position>(thrust_particles).begin(),
                 get<position>(thrust_particles).end(),
                 [] __device__(const int i) {
                   thrust::default_random_engine gen;
                   thrust::uniform_real_distribution<float> uni(0, 1);
                   gen.discard(i);
                   return vdouble2(uni(gen), uni(gen));
                 });

Now we can initialise the neighbourhood search data structure. Note that we are using Aboria::CellListOrdered data structure, which is similar to Aboria::CellList but instead relies on reordering the particles to arrange them into cells, which is more amenable to parallelisation using a GPU.

thrust_particles.init_neighbour_search(
    vdouble2::Constant(0), vdouble2::Constant(1), vbool2::Constant(false));

We can use any of Aboria's range searches within a Thrust algorithm. Below we will implement a range search around each particle, counting all neighbours within range. Note that we need to copy all of the variables from the outer scope to the lambda function, since the lambda will run on the device, and won't be able to access any host memory.

	Note
	The `Aboria::NeighbourQueryBase` class for each spatial data structure is designed to be copyable to the GPU, but the `Aboria::Particles` class is not, so while the `query` variable is copyable to the device, the `thrust_particles` variable is not.

	Note
	The type of variable `i` in the lambda will be deduced as `Aboria::Particles::raw_reference`. This is different to `Aboria::Particles::reference` when using `thrust::device_vector`, but acts in a similar fashion

thrust::for_each(
    thrust_particles.begin(), thrust_particles.end(),
    [radius, query = thrust_particles.get_query()] __device__(auto i) {
      get<thrust_neighbour_count>(i) = 0;
      for (auto j = euclidean_search(query, get<position>(i), radius);
           j != false; ++j) {
        ++get<thrust_neighbour_count>(i);
      }
    });

While we have exclusively used thrust::for_each above, the iterators that Aboria provides for the Aboria::Particles container should work with all of Thrust's algorithms. For example, you might wish to restructure the previous code as a transform:

    thrust::transform(
        thrust_particles.begin(), thrust_particles.end(),
        get<thrust_neighbour_count>(thrust_particles).begin(),
        [radius, query = thrust_particles.get_query()] __device__(auto i) {
          int sum = 0;
          for (auto j = euclidean_search(query, get<position>(i), radius);
               j != false; ++j) {
            ++sum;
          }
          return sum;
        });
}