Yesterday I learned how to speed up the vector operation y := y + αx by using SSE2 instructions.
The function I was trying to speed up was as follows:
static void
daxpy0(int n, double a, const double *x, double *y)
{
int i;
for (i=0; i<n; ++i) {
y[i] += a * x[i];
}
}
Using SSE2 instructions, operating on two double precision numbers at a time, this can be written as follows:
#include <emmintrin.h>
static void
daxpy1(int n, double a, const double *x, double *y)
{
__m128d aa, xx, yy;
int i;
aa = _mm_set1_pd(a); /* store a in both registers */
for (i=0; i+1<n; i+=2) {
xx = _mm_load_pd(x+i); /* load two elements from x ... */
xx = _mm_mul_pd(aa, xx); /* ... and multiply both by a */
yy = _mm_load_pd(y+i); /* load two elements from y ... */
yy = _mm_add_pd(yy, xx); /* ... and add */
_mm_store_pd(y+i, yy) /* write back both elements of y */
}
if (i < n) {
y[i] += a * x[i]; /* handle last element if n is odd */
}
}
This code required that the vectors x and y
are 16-byte aligned (otherwise a segmentation fault will occur). This
assumption holds, for example, for memory blocks allocated by
malloc on 64bit Linux and MacOS X systems. Also,
obviously, this only works on CPUs which support the SSE2 instruction set.
A description of the SSE2 instructions used here can be found in the
Intrinsics Reference
of the
Intel C++ Compiler Documentation
(also seems to apply to the GNU C compiler; direct link
here).
For comparison, I also tried to use the daxpy function
provided by BLAS:
static void
daxpy2(int n, double a, const double *x, double *y)
{
extern void daxpy_(int *Np, double *DAp,
const double *X, int *INCXp,
double *Y, int *INCYp);
int inc = 1;
daxpy_(&n, &a, x, &inc, y, &inc);
}
Details about this approach are described on my page about linear algebra packages on Linux.
Results. I tried the three functions using two
vectors of length 2000. The following table gives the execution time for a
million calls to daxpy (on a newish quad-core Linux machine):
| method | function | time [s] | direct | daxpy0
| 1.52 |
|---|---|---|
| SSE2 | daxpy1
| 0.91 |
| BLAS | daxpy2
| 2.47 |
As can be seen, the SSE2 code in daxpy1 is fastest,
compared to the naive implementation daxpy0 it takes 40% off
the execution time! For some reason, the BLAS version seems to be very
slow; and the results on my MacOS machine are similar. Currently I have no
explanation for this effect.