For the past few weeks I have been experimenting with Streaming SIMD Extension instructions.
It takes a while to write at this lower level, but it's fun to squeeze optimizations from your code.
Most processors since about 1999 contain eight 128 bit registers that can be used to do multiple computations in one instruction. 128 bits can hold 4 32-bit floating point numbers, and so
there could be a theoretical 4x improvement in speed, as long as the same operation is done to each. For example, to sum arrays, instead of:
one can write:
The _mm_add_ps instrinsic is just 1 instruction on the cpu (addps). Similar code can be written with 32-bit integers, and a few other data types. Here is an example of optimizing with SSE instructions.
Writing code in C is fun because I can keep optimizing and optimizing. Here is my initial code to draw a chaotic map. The array "arr" contains the results and will be displayed after this loop is run. This is a simplified version; the real code has a few more details like settling time.
I was already using threading to keep both of my cores busy. Let's say, for fun, that we want this to run even faster. I think CUDA would be a good way to optimize, but it requires the right hardware.
I was a bit concerned about the int casts becoming a slow _ftol call, but
looking at the generated assembly showed that this wasn't the case. (Read here,here for why casts can be slow, although with modern compilers and instruction sets this doesn't seem to be an issue).
I rewrote using SSE to work with 4 values at a time.
I'm using SSE's way to do conditions.
If a comparison is true, the field is filled with 0xffffffff, and if false, 0x00000000.
This version was faster, down to 2,850ms.
A further optimization was to vectorize finding the array pixel location. Instead of "if (inBounds.m128i_i32[3] != 0)...", I could write:
I've eliminated some conditionals. The trick is to mask out the point if inBounds is false with the line "pixel = _mm_and_si128(inBounds, pixel);".
False is represented by 0x00000000, and so if inBounds is false, pixel is set to 0. All out-of-bounds pixels are drawn onto (0,0), but this pixel can be cleared later.
The speed is improved to 1,721ms.
I notice some redundancy in "int py = (int)(SIZE * ((y - Y0) / (Y1 - Y0)))" and "pixel=x+y*size." In both, y is multiplied by a constant.
I could eliminate a multiply with something like:
However, this didn't work in sse vectors because of overflow with _mm_cvttps_epi32.
My final step is the least elegant: loop unrolling. Repeating the loop body 4 times gave the fastest results. (Faster because less overhead and maybe better scheduling, there is a trade off because of cache misses as size increases).
This final step also made a significant difference.
The final time is 1,362ms, which is more than twice as fast.