Here we aim to compute a simple vector addition operation
$$ a_i = b_i + c_i \text{ for } i = 0...N. $$
A particle set containing the variables $a$, $b$ and $c$ can be defined in Aboria like so
ABORIA_VARIABLE(a_var,double,"a") ABORIA_VARIABLE(b_var,double,"b") ABORIA_VARIABLE(c_var,double,"c") typedef Particles<std::tuple<a_var,b_var,c_var>,3> nodes_type; nodes_type nodes(N);
The vector addition operation can then be calculated using the Level 3 layer like so
Symbol<a_var> a; Symbol<b_var> b; Symbol<c_var> c; Label<0,nodes_type> i(nodes); a[i] = b[i] + c[i];
We compare this with Level 1 Aboria using the get
functions and looping through the container
for (size_t i=0; i<N; i++) { get<a_var>(nodes)[i] = get<b_var>(nodes)[i] + get<c_var>(nodes)[i]; }
We also compare against a plain std::vector
implementation like so
std::vector<double> a(N),b(N),c(N); for (size_t i=0; i<N; i++) { a[i] = b[i] + c[i]; }
Finally we compare against an Eigen implementation:
typedef Eigen::Matrix<double,Eigen::Dynamic,1> vector_type; vector_type a(N),b(N),c(N); a = b + c;
We can measure the time taken by the last line in the code segment above for varying $N$, and compare the four different implementations
The resultant benchmarks are shown in the Figure below, where it can be seen that the four approaches are very similar in speed, confirming that [Aboria][] can achieve zero-cost abstraction, at least in this simple case. More complicated cases are explored below.
This benchmark is for the BLAS DAXPY operation, given by
$$ a_i = a_i + 0.1*b_i \text{ for } i = 0...N. $$
This is implemented in Aboria using
a[i] += 0.1*b[i];
We compare against a Level 1 implementation like so
for (size_t i=0; i<N; i++) { get<a_var>(nodes)[i] += 0.1*get<b_var>(nodes)[i]; }
and a std::vector
implementation like so
std::vector<double> a(N),b(N); for (size_t i=0; i<N; i++) { a[i] += 0.1*b[i]; }
and an Eigen implementation
typedef Eigen::Matrix<double,Eigen::Dynamic,1> vector_type; vector_type a(N),b(N); a += 0.1*b;
The comarison benchmarks for varying $N$ are shown below
Here we move onto a dense, $N^2$ operation, given by the non-linear operator
$$ a_i = a_i + \sum_j^N a_j \sqrt{\mathbf{dx}_{ij} \cdot \mathbf{dx}_{ij} + b_j^2} \text{ for } i = 0...N. $$
where $\mathbf{dx}_{ij}$ is the shortest vector from particle $i$ to $i$. This is implemented in Level 3 Aboria like so
Label<1,nodes_type> j(nodes); auto dx = create_dx(i,j); Accumulate<std::plus<double> > sum; a[i] += sum(j,true,a[j]*sqrt(dot(dx,dx)+b[j]*b[j]));
This is compared against a std::vector
implementation. Note that this operator involves aliasing, in that the update
variable $a$ appears within the sum, so we need to accumulate the update
to a temporary buffer before we assign to $a_i$.
The implementation is shown below (note the openMP parallel loops are turned off for the plot below)
std::vector<double> x(N), y(N), b(N), a(N), a_buffer(N); #pragma omp parallel for for (size_t i = 0; i < N; ++i) { a_buffer[i] = a[i]; for (size_t j = 0; j < N; ++j) { const double dx_x = x[j]-x[i]; const double dx_y = y[j]-x[i]; a_buffer[i] += a[j]*std::sqrt(dx_x*dx_x+dx_y*dx_y+b[j]*b[j]); } } #pragma omp parallel for for (size_t i = 0; i < N; ++i) { a[i] = a_buffer[i]; }
The benchmarks are shown below.
Finally we implement a non-linear operator involving a neighbour search, common in particle-based methods. This is given by
$$ a_i = \sum_j^N \begin{cases} \frac{r-|\mathbf{dx}_{ij}|}{|\mathbf{dx}_{ij}|}\mathbf{dx}_{ij} , & \text{for } |\mathbf{dx}_{ij}|<r \\ 0 & \text{otherwise}, \end{cases} \text{ for } i = 0...N. $$
where $r$ is a given constant.
a[i] = sum(j,norm(dx)<r,(r-norm(dx))/norm(dx)*dx);
The benchmarks are shown below.