Yesterday I learned how to speed up the vector operation y := y + αx by using SSE2 instructions.
The function I was trying to speed up was as follows:
static void
daxpy0(int n, double a, const double *x, double *y)
{
int i;
for (i=0; i<n; ++i) {
y[i] += a * x[i];
}
}
Using SSE2 instructions, operating on two double precision numbers at a time, this can be written as follows:
#include <emmintrin.h>
static void
daxpy1(int n, double a, const double *x, double *y)
{
__m128d aa, xx, yy;
int i;
aa = _mm_set1_pd(a); /* store a in both registers */
for (i=0; i+1<n; i+=2) {
xx = _mm_load_pd(x+i); /* load two elements from x ... */
xx = _mm_mul_pd(aa, xx); /* ... and multiply both by a */
yy = _mm_load_pd(y+i); /* load two elements from y ... */
yy = _mm_add_pd(yy, xx); /* ... and add */
_mm_store_pd(y+i, yy) /* write back both elements of y */
}
if (i < n) {
y[i] += a * x[i]; /* handle last element if n is odd */
}
}
This code required that the vectors x and y
are 16-byte aligned (otherwise a segmentation fault will occur). This
assumption holds, for example, for memory blocks allocated by
malloc on 64bit Linux and MacOS X systems. Also,
obviously, this only works on CPUs which support the SSE2 instruction set.
A description of the SSE2 instructions used here can be found in the
Intrinsics Reference
of the
Intel C++ Compiler Documentation
(also seems to apply to the GNU C compiler).
For comparison, I also tried to use the daxpy function
provided by BLAS:
static void
daxpy2(int n, double a, const double *x, double *y)
{
extern void daxpy_(int *Np, double *DAp,
const double *X, int *INCXp,
double *Y, int *INCYp);
int inc = 1;
daxpy_(&n, &a, x, &inc, y, &inc);
}
Details about this approach are described on my page about linear algebra packages on Linux.
Results. I tried the three functions using two
vectors of length 2000. The following table gives the execution time for a
million calls to daxpy (on a newish quad-core Linux machine):
| method | function | time [s] | direct | daxpy0
| 1.52 |
|---|---|---|
| SSE2 | daxpy1
| 0.91 |
| BLAS | daxpy2
| 2.47 |
As can be seen, the SSE2 code in daxpy1 is fastest,
compared to the naive implementation daxpy0 it takes 40% off
the execution time! For some reason, the BLAS version seems to be very
slow; and the results on my MacOS machine are similar. Currently I have no
explanation for this effect.
Recently I learned how to tunnel http traffic (e.g. web surfing) over an ssh connection. The effect of this is that you can browse the web on one computer A, say, but for the web servers you are visiting it will look like your requests originate from a different host B. You need to be able to log into host B via ssh for this to work.
There are several situations where such tunneling is useful:
Setting up a tunnel is done in two steps:
ssh -D 8080 -f -q -N login@host
You will need to replace login and host with your login details. The machine you type this command on is machine A in the description above, and host is machine B. This command will start an ssh process which will run in the background and will act as a tunnel to forward the web traffic.
localhostand port
8080. Both SOCKS4 and SOCKS5 should work. This will tell the web browser to connect to the local end of the ssh tunnel.
Over the weekend I finished a helper which allows to easily distribute a number of programs over the available CPUs on a multi-core system. You can find the program on the Parallel homepage.
I just found somebody's blog post stating 8 Reasons Normal People Should Juggle and I agree very much with his reasons.
Yesterday, I completed version 0.6 of my Python parser generator Wisent. The new version allows to use UTF-8 encoded grammar files (originally, only ASCII characters could be used in grammar files), adds a CSS parser as an additional example, and fixes some minor bugs. You can download the programm from the Wisent homepage.
Yesterday I finished a revised version of my article about exponential Tauberian theorems. The main change is, that it transpired that my proof could be trivially modified to get a more general statement. Many thanks to the (anonymous) referee for pointing this out.
Today, one of our papers was accepted:
Just in case this is useful for anybody: here is an implementation of the HSV to RGB colorspace conversion in Python:
from __future__ import division
def hsv2rgb(h, s, v):
hi = int(h//60 % 6)
f = h/60 - h//60
p = v * (1-s)
q = v * (1-f*s)
t = v * (1-(1-f)*s)
return [ (v,t,p), (q,v,p), (p,v,t), (p,q,v), (t,p,v), (v,p,q) ][hi]
Update. In the meantime I learned that the Python standard library has a built-in version of this function in the colorsys module.
Today, I finally submitted the SPDE-paper I wrote with Martin Hairer and Andrew Stuart. This paper took a long time to complete (and it changed from an applied maths paper into a pure maths paper in the process). You can download a pre-print here; comments are very welcome:
A while ago, Andreas and I submitted another paper. Here is a pre-print:
A. Voss, J. Voss and K.C. Klauer:
Separating Response-Execution Bias from Decision Bias: Arguments for an Additional Parameter in Ratcliff's Diffusion Model.
To appear in the British Journal of Mathematical and Statistical Psychology,
2009.
preprint, more…
Older entries can be found on the next page …
Copyright © 2009, Jochen Voss. All content on this website (including text, pictures, and any other original works), unless otherwise noted, is licensed under a Creative Commons Attribution-Share Alike 3.0 License.