SSE instructions

For the past few weeks I have been experimenting with Streaming SIMD Extension instructions. It takes a while to write at this lower level, but it's fun to squeeze optimizations from your code.

Most processors since about 1999 contain eight 128 bit registers that can be used to do multiple computations in one instruction. 128 bits can hold 4 32-bit floating point numbers, and so there could be a theoretical 4x improvement in speed, as long as the same operation is done to each. For example, to sum arrays, instead of:

float a[4];
float b[4];
float c[4];
// assign values to a and b...
c[0] = a[0] + b[0];
c[1] = a[1] + b[1];
c[2] = a[3] + b[2];
c[3] = a[3] + b[3];

one can write:

__m128 a, b, c;
c = _mm_add_ps(a, b);

The _mm_add_ps instrinsic is just 1 instruction on the cpu (addps). Similar code can be written with 32-bit integers, and a few other data types. Here is an example of optimizing with SSE instructions.

Writing code in C is fun because I can keep optimizing and optimizing. Here is my initial code to draw a chaotic map. The array "arr" contains the results and will be displayed after this loop is run. This is a simplified version; the real code has a few more details like settling time.

#define nseeds 4
#define SIZE 256
float xseeds[nseeds] = {0.0f, 0.0f, 0.0f, 0.0f};
float yseeds[nseeds] = {0.000001f, 0.000002f, -0.0000011f, -0.0000011f};
float x, y, x_, y_, x0 = -1, x1 = 1, y0 = -1, y1 = 1;
for(int seed = 0; seed < nseeds; seed++) {
  x = xseeds[seed];
  y = yseeds[seed];
  
  for(int i = 0; i < PixelsToDrawPerSeed; i++) {
    x_ = c1*x - y*y; 
    y_ = c2*y + x*y;
    x = x_;
    y = y_;

    // scale to pixel coordinates, and plot point if it is within bounds.
    int px = (int)(SIZE * ((x - X0) / (X1 - X0)));
    int py = (int)(SIZE * ((y - Y0) / (Y1 - Y0)));
    if (py >= 0 && py < SIZE && px >= 0 && px < SIZE) {
      arr[px + py*SIZE] = 1; // plot point at px, py.
    }
  }
}

I was already using threading to keep both of my cores busy. Let's say, for fun, that we want this to run even faster. I think CUDA would be a good way to optimize, but it requires the right hardware.

I was a bit concerned about the int casts becoming a slow _ftol call, but looking at the generated assembly showed that this wasn't the case. (Read here,here for why casts can be slow, although with modern compilers and instruction sets this doesn't seem to be an issue).

I rewrote using SSE to work with 4 values at a time.

__m128 x128 = _mm_setr_ps( 0.0f, 0.0f, 0.0f, 0.0f);
__m128 y128 = _mm_setr_ps( 0.000001f, 0.000002f,-0.0000011f,-0.0000011f);
__m128 mConst_a = _mm_set1_ps(c1); //sets all 4 fields to c1
__m128 mConst_b = _mm_set1_ps(c2);
__m128 nextx128, tempx, tempy;
__m128i const_size = _mm_set1_epi32(SIZE), const_zero = _mm_set1_epi32(0);

for(int i = 0; i < PixelsToDrawPerSeed; i++)
{
  nextx128 = _mm_sub_ps(_mm_mul_ps(mConst_a, x128), _mm_mul_ps(y128,y128));
  y128 = _mm_mul_ps(y128, _mm_add_ps(x128, mConst_b)); 
  x128 = nextx128;

  // scale to pixel coordinates. assume that pixels are square, that X1-X0==Y1-Y0.
  // the following computes px = (int)(SIZE * ((x - X0) / (X1 - X0)));
  __m128 mConst_scale = _mm_set1_ps(SIZE / (X1 - X0));
  tempx = _mm_sub_ps(x128, _mm_set1_ps(X0));
  tempx = _mm_mul_ps(tempx, mConst_scale); 
  tempy = _mm_sub_ps(y128, _mm_set1_ps(Y0));
  tempy = _mm_mul_ps(tempy, mConst_scale); 
  __m128i xPt = _mm_cvttps_epi32(tempx); // cast to int32
  __m128i yPt = _mm_cvttps_epi32(tempy); 
  __m128i xInBounds = _mm_and_si128(_mm_cmpgt_epi32(xPt, const_zero), _mm_cmplt_epi32(xPt, const_size));
  __m128i yInBounds = _mm_and_si128(_mm_cmpgt_epi32(yPt, const_zero), _mm_cmplt_epi32(yPt, const_size));
  __m128i inBounds = _mm_and_si128(xInBounds, yInBounds);
  
  if (inBounds.m128i_i32[3] != 0)
    arr[xPt.m128i_i32[3] + yPt.m128i_i32[3]*SIZE] = 1;
  if (inBounds.m128i_i32[2] != 0)
    arr[xPt.m128i_i32[2] + yPt.m128i_i32[2]*SIZE] = 1;
  if (inBounds.m128i_i32[1] != 0)
    arr[xPt.m128i_i32[1] + yPt.m128i_i32[1]*SIZE] = 1;
  if (inBounds.m128i_i32[0] != 0)
    arr[xPt.m128i_i32[0] + yPt.m128i_i32[0]*SIZE] = 1;
}

I'm using SSE's way to do conditions. If a comparison is true, the field is filled with 0xffffffff, and if false, 0x00000000. This version was faster, down to 2,850ms.

A further optimization was to vectorize finding the array pixel location. Instead of "if (inBounds.m128i_i32[3] != 0)...", I could write:

// since size is a power of 2, we can shift left instead of multiply!
// e.g. if SIZE=128, LOG2SIZE=7
__m128i pixel = _mm_add_epi32(xPt, _mm_slli_epi32(yPt, LOG2SIZE));
pixel = _mm_and_si128(inBounds, pixel); // if it's out-of-bounds, draw to at 0,0
arr[pixel.m128i_i32[3]] = 1;
arr[pixel.m128i_i32[2]] = 1;
arr[pixel.m128i_i32[1]] = 1;
arr[pixel.m128i_i32[0]] = 1;

I've eliminated some conditionals. The trick is to mask out the point if inBounds is false with the line "pixel = _mm_and_si128(inBounds, pixel);". False is represented by 0x00000000, and so if inBounds is false, pixel is set to 0. All out-of-bounds pixels are drawn onto (0,0), but this pixel can be cleared later. The speed is improved to 1,721ms.

I notice some redundancy in "int py = (int)(SIZE * ((y - Y0) / (Y1 - Y0)))" and "pixel=x+y*size." In both, y is multiplied by a constant. I could eliminate a multiply with something like:

int py_times_size = (int)(SIZE * SIZE * ((y - Y0) / (Y1 - Y0)))
if py_times_size > 0 and py_times_size < SIZE * SIZE and x > 0...
pixel = x + py_times_size...

However, this didn't work in sse vectors because of overflow with _mm_cvttps_epi32.

My final step is the least elegant: loop unrolling. Repeating the loop body 4 times gave the fastest results. (Faster because less overhead and maybe better scheduling, there is a trade off because of cache misses as size increases). This final step also made a significant difference.

The final time is 1,362ms, which is more than twice as fast.