fast vector operations using SSE2

By Jochen Voss, on 2010-02-03

Yesterday I learned how to speed up the vector operation y := y + αx by using SSE2 instructions.

The function I was trying to speed up was as follows:

static void
daxpy0(int n, double a, const double *x, double *y)
{
  int i;
  for (i=0; i<n; ++i) {
    y[i] += a * x[i];
  }
}

Using SSE2 instructions, operating on two double precision numbers at a time, this can be written as follows:

#include <emmintrin.h>

static void
daxpy1(int n, double a, const double *x, double *y)
{
  __m128d aa, xx, yy;
  int i;

  aa = _mm_set1_pd(a);       /* store a in both registers */
  for (i=0; i+1<n; i+=2) {
    xx = _mm_load_pd(x+i);   /* load two elements from x ... */
    xx = _mm_mul_pd(aa, xx); /* ... and multiply both by a */
    yy = _mm_load_pd(y+i);   /* load two elements from y ... */
    yy = _mm_add_pd(yy, xx); /* ... and add */
    _mm_store_pd(y+i, yy)    /* write back both elements of y */
  }
  if (i < n) {
    y[i] += a * x[i];        /* handle last element if n is odd */
  }
}

This code required that the vectors x and y are 16-byte aligned (otherwise a segmentation fault will occur). This assumption holds, for example, for memory blocks allocated by malloc on 64bit Linux and MacOS X systems. Also, obviously, this only works on CPUs which support the SSE2 instruction set. A description of the SSE2 instructions used here can be found in the Intrinsics Reference of the Intel C++ Compiler Documentation (also seems to apply to the GNU C compiler; direct link here).

For comparison, I also tried to use the daxpy function provided by BLAS:

static void
daxpy2(int n, double a, const double *x, double *y)
{
  extern void daxpy_(int *Np, double *DAp,
                     const double *X, int *INCXp,
                     double *Y, int *INCYp);
  int inc = 1;
  daxpy_(&n, &a, x, &inc, y, &inc);
}

Details about this approach are described on my page about linear algebra packages on Linux.

Results. I tried the three functions using two vectors of length 2000. The following table gives the execution time for a million calls to daxpy (on a newish quad-core Linux machine):

method	function	time [s]
direct	`daxpy0`	1.52
SSE2	`daxpy1`	0.91
BLAS	`daxpy2`	2.47

As can be seen, the SSE2 code in daxpy1 is fastest, compared to the naive implementation daxpy0 it takes 40% off the execution time! For some reason, the BLAS version seems to be very slow; and the results on my MacOS machine are similar. Currently I have no explanation for this effect.