Aboria can use OpenMP or CUDA to utilise multiple cores or Nvidia GPUs that you might have. In general, Aboria uses the parallel algorithms and vectors provided with Thrust to do this. From there, you can either use Thrust's OpenMP or CUDA backends to provide the type of parallelism you wish. However, there are a few parts of Aboria that are OpenMP only (notably the entirety of the symbolic and kernel APIs).
You don't have to do much to start using OpenMP for the high level symbolic
or kernel interfaces, all that is required is that you install OpenMP and
Thrust and add the HAVE_THRUST
compiler definition (see aboria.installation_and_getting_started).
For lower-level programming you will need to add a few pragmas to your code,
a few examples of which are discussed below.
First, let's create a set of N
particles as usual
const size_t N = 100; ABORIA_VARIABLE(neighbour_count, int, "neighbour_count"); typedef Particles<std::tuple<neighbour_count>, 2> particle_t; typedef particle_t::position position; particle_t particles(N);
Now we will loop through the particles and set their initial positions randomly. In order to use an OpenMP parallel for loop, we will stick to a simple index-based loop for loop to iterate through the particles, like so
#pragma omp parallel for for (size_t i = 0; i < particles.size(); ++i) { std::uniform_real_distribution<double> uniform(0, 1); auto gen = get<generator>(particles)[i]; get<position>(particles)[i] = vdouble2(uniform(gen), uniform(gen)); }
Now we can initialise the neighbourhood search data structure. Note that all creation and updates to the spatial data structures are run in parallel.
Note | |
---|---|
currently the only data structure that is created or updated in serial
is |
particles.init_neighbour_search( vdouble2::Constant(0), vdouble2::Constant(1), vbool2::Constant(false));
We will use Aboria's range search to look for neighbouring pairs within a cutoff, and once again use OpenMPs parallel loop. All queries to the spatial data structures are thread-safe and can be used in parallel.
const double radius = 0.1; #pragma omp parallel for for (size_t i = 0; i < particles.size(); ++i) { for (auto j = euclidean_search(particles.get_query(), get<position>(particles)[i], radius); j != false; ++j) { ++get<neighbour_count>(particles)[i]; } }
In general, that is 90% of what you need to know, just add a couple of OpenMP pragmas to your loops and you are ready to go!
Caution | |
---|---|
CUDA support in Aboria is experimental, and is not tested regularly. We welcome feedback by any CUDA users if anything doesn't work for you |
Writing CUDA compatible code is slightly more involved. Aboria uses the Thrust library for CUDA parallism, and follows similar patterns (i.e. STL-like).
Most importantly, we need to make sure that all the particle data is contained
in vectors that are stored on the GPU. To do this we use a thrust::device_vector
as the base storage vector
for our particles class
Note | |
---|---|
we want to use the type |
ABORIA_VARIABLE(thrust_neighbour_count, int, "thrust_neighbour_count"); int main() { typedef Particles<std::tuple<thrust_neighbour_count>, 2, thrust::device_vector, CellListOrdered> thrust_particle_t; thrust_particle_t thrust_particles(N);
Since all our data is on the device, we cannot use raw for loops to access
this data without copying it back to the host, an expensive operation. Instead,
Thrust provides a wide variety of parallel algorithms to manipulate the data.
Aboria's Aboria::zip_iterator
is compatible with the Thrust framework, so can be used in a similar fashion
to Thrust's own zip_iterator
(except, unlike Thrust's zip_iterator
,
we can take advantage of Aboria's tagged reference
and value_types
).
We can use Thrust's tabulate
algorithm to loop through the particles and set their initial positions randomly.
thrust::tabulate(get<position>(thrust_particles).begin(), get<position>(thrust_particles).end(), [] __device__(const int i) { thrust::default_random_engine gen; thrust::uniform_real_distribution<float> uni(0, 1); gen.discard(i); return vdouble2(uni(gen), uni(gen)); });
Now we can initialise the neighbourhood search data structure. Note that
we are using Aboria::CellListOrdered
data structure, which is similar to Aboria::CellList
but instead relies on reordering the particles to arrange them into cells,
which is more amenable to parallelisation using a GPU.
thrust_particles.init_neighbour_search( vdouble2::Constant(0), vdouble2::Constant(1), vbool2::Constant(false));
We can use any of Aboria's range searches within a Thrust algorithm. Below we will implement a range search around each particle, counting all neighbours within range. Note that we need to copy all of the variables from the outer scope to the lambda function, since the lambda will run on the device, and won't be able to access any host memory.
Note | |
---|---|
The |
Note | |
---|---|
The type of variable |
thrust::for_each( thrust_particles.begin(), thrust_particles.end(), [radius, query = thrust_particles.get_query()] __device__(auto i) { get<thrust_neighbour_count>(i) = 0; for (auto j = euclidean_search(query, get<position>(i), radius); j != false; ++j) { ++get<thrust_neighbour_count>(i); } });
While we have exclusively used thrust::for_each
above, the iterators that Aboria provides for the Aboria::Particles
container should work with all of Thrust's algorithms. For example, you might
wish to restructure the previous code as a transform:
thrust::transform( thrust_particles.begin(), thrust_particles.end(), get<thrust_neighbour_count>(thrust_particles).begin(), [radius, query = thrust_particles.get_query()] __device__(auto i) { int sum = 0; for (auto j = euclidean_search(query, get<position>(i), radius); j != false; ++j) { ++sum; } return sum; }); }