By Jochen Voss, on
Yesterday I learned how to speed up the vector operation y := y + αx by using SSE2 instructions.
The function I was trying to speed up was as follows:
static void daxpy0(int n, double a, const double *x, double *y) { int i; for (i=0; i<n; ++i) { y[i] += a * x[i]; } }
Using SSE2 instructions, operating on two double precision numbers at a time, this can be written as follows:
#include <emmintrin.h> static void daxpy1(int n, double a, const double *x, double *y) { __m128d aa, xx, yy; int i; aa = _mm_set1_pd(a); /* store a in both registers */ for (i=0; i+1<n; i+=2) { xx = _mm_load_pd(x+i); /* load two elements from x ... */ xx = _mm_mul_pd(aa, xx); /* ... and multiply both by a */ yy = _mm_load_pd(y+i); /* load two elements from y ... */ yy = _mm_add_pd(yy, xx); /* ... and add */ _mm_store_pd(y+i, yy) /* write back both elements of y */ } if (i < n) { y[i] += a * x[i]; /* handle last element if n is odd */ } }
This code required that the vectors x
and y
are 16-byte aligned (otherwise a segmentation fault will occur). This
assumption holds, for example, for memory blocks allocated by
malloc
on 64bit Linux and MacOS X systems. Also,
obviously, this only works on CPUs which support the SSE2 instruction set.
A description of the SSE2 instructions used here can be found in the
Intrinsics Reference
of the
Intel C++ Compiler Documentation
(also seems to apply to the GNU C compiler; direct link
here).
For comparison, I also tried to use the daxpy
function
provided by BLAS:
static void daxpy2(int n, double a, const double *x, double *y) { extern void daxpy_(int *Np, double *DAp, const double *X, int *INCXp, double *Y, int *INCYp); int inc = 1; daxpy_(&n, &a, x, &inc, y, &inc); }
Details about this approach are described on my page about linear algebra packages on Linux.
Results. I tried the three functions using two
vectors of length 2000. The following table gives the execution time for a
million calls to daxpy
(on a newish quad-core Linux machine):
method | function | time [s] | direct | daxpy0
| 1.52 |
---|---|---|
SSE2 | daxpy1
| 0.91 |
BLAS | daxpy2
| 2.47 |
As can be seen, the SSE2 code in daxpy1
is fastest,
compared to the naive implementation daxpy0
it takes 40% off
the execution time! For some reason, the BLAS version seems to be very
slow; and the results on my MacOS machine are similar. Currently I have no
explanation for this effect.
This is an excerpt from Jochen's blog.
Newer entry: scrabble
Older entry: tunnelling http over ssh
Copyright © 2010 Jochen Voss. All content on this website (including text, pictures, and any other original works), unless otherwise noted, is licensed under the CC BY-SA 4.0 license.