Benchmarks

Vector Addition

Here we aim to compute a simple vector addition operation

$$ a_i = b_i + c_i \text{ for } i = 0...N. $$

A particle set containing the variables $a$, $b$ and $c$ can be defined in Aboria like so

ABORIA_VARIABLE(a_var,double,"a")
ABORIA_VARIABLE(b_var,double,"b")
ABORIA_VARIABLE(c_var,double,"c")
typedef Particles<std::tuple<a_var,b_var,c_var>,3> nodes_type;
nodes_type nodes(N);

The vector addition operation can then be calculated using the Level 3 layer like so

Symbol<a_var> a;
Symbol<b_var> b;
Symbol<c_var> c;
Label<0,nodes_type> i(nodes);
a[i] = b[i] + c[i];

We compare this with Level 1 Aboria using the get functions and looping through the container

for (size_t i=0; i<N; i++) {
    get<a_var>(nodes)[i] = get<b_var>(nodes)[i] + get<c_var>(nodes)[i];
}

We also compare against a plain std::vector implementation like so

std::vector<double> a(N),b(N),c(N);
for (size_t i=0; i<N; i++) {
    a[i] = b[i] + c[i];
}

Finally we compare against an Eigen implementation:

typedef Eigen::Matrix<double,Eigen::Dynamic,1> vector_type;
vector_type a(N),b(N),c(N);
a = b + c;

We can measure the time taken by the last line in the code segment above for varying $N$, and compare the four different implementations

The resultant benchmarks are shown in the Figure below, where it can be seen that the four approaches are very similar in speed, confirming that [Aboria][] can achieve zero-cost abstraction, at least in this simple case. More complicated cases are explored below.

DAXPY

This benchmark is for the BLAS DAXPY operation, given by

$$ a_i = a_i + 0.1*b_i \text{ for } i = 0...N. $$

This is implemented in Aboria using

a[i] += 0.1*b[i];

We compare against a Level 1 implementation like so

for (size_t i=0; i<N; i++) {
    get<a_var>(nodes)[i] += 0.1*get<b_var>(nodes)[i];
}

and a std::vector implementation like so

std::vector<double> a(N),b(N);
for (size_t i=0; i<N; i++) {
    a[i] += 0.1*b[i];
}

and an Eigen implementation

typedef Eigen::Matrix<double,Eigen::Dynamic,1> vector_type;
vector_type a(N),b(N);
a += 0.1*b;

The comarison benchmarks for varying $N$ are shown below

Dense non-linear operator

Here we move onto a dense, $N^2$ operation, given by the non-linear operator

$$ a_i = a_i + \sum_j^N a_j \sqrt{\mathbf{dx}_{ij} \cdot \mathbf{dx}_{ij} + b_j^2} \text{ for } i = 0...N. $$

where $\mathbf{dx}_{ij}$ is the shortest vector from particle $i$ to $i$. This is implemented in Level 3 Aboria like so

Label<1,nodes_type> j(nodes);
auto dx = create_dx(i,j);
Accumulate<std::plus<double> > sum;
a[i] += sum(j,true,a[j]*sqrt(dot(dx,dx)+b[j]*b[j]));

This is compared against a std::vector implementation. Note that this operator involves aliasing, in that the update variable $a$ appears within the sum, so we need to accumulate the update to a temporary buffer before we assign to $a_i$.

The implementation is shown below (note the openMP parallel loops are turned off for the plot below)

std::vector<double> x(N), y(N), b(N), a(N), a_buffer(N);

#pragma omp parallel for
for (size_t i = 0; i < N; ++i) {
    a_buffer[i] = a[i];
    for (size_t j = 0; j < N; ++j) {
        const double dx_x = x[j]-x[i];
        const double dx_y = y[j]-x[i];
        a_buffer[i] += a[j]*std::sqrt(dx_x*dx_x+dx_y*dx_y+b[j]*b[j]);
    }
}
#pragma omp parallel for
for (size_t i = 0; i < N; ++i) {
    a[i] = a_buffer[i];
}

The benchmarks are shown below.

Neighbour search non-linear operator

Finally we implement a non-linear operator involving a neighbour search, common in particle-based methods. This is given by

$$ a_i = \sum_j^N \begin{cases} \frac{r-|\mathbf{dx}_{ij}|}{|\mathbf{dx}_{ij}|}\mathbf{dx}_{ij} , & \text{for } |\mathbf{dx}_{ij}|<r \\ 0 & \text{otherwise}, \end{cases} \text{ for } i = 0...N. $$

where $r$ is a given constant.

a[i] = sum(j,norm(dx)<r,(r-norm(dx))/norm(dx)*dx);

The benchmarks are shown below.