From slow to SIMD: A Go optimization story

>>rbanff+(OP)
I wonder how well a simple C for loop with -O3 and maybe -march would do here.

From my brief forays into reading (mostly AARCH64) assembly, it looks like C compilers can detect these kinds of patterns now and just convert them all to SIMD by themselves, with no work from the programmer. Even at -O2, converting an index-based loop into one based on start and end pointers is not unusual. Go doesn't seem to do this, the assembly output by the Go compiler looks much closer to the actual code than what you get from C.

Rust iterators would also be fun to benchmark here, they're supposed to be as fast as plain old loops, and they're probably optimized to omit bounds checks entirely.

>>miki12+ND
It depends.

You need 2~3 accumulators to saturate instruction-level parallelism with a parallel sum reduction. But the compiler won't do it because it only creates those when the operation is associative, i.e. (a+b)+c = a+(b+c), which is true for integers but not for floats.

There is an escape hatch in -ffast-math.

I have extensive benches on this here: https://github.com/mratsim/laser/blob/master/benchmarks%2Ffp...

>>mratsi+dH
You don't necessarily need `-ffast-math` if you are willing to tweak your code a bit, since you can just introduce accumulators by yourself. Such optimizer-friendly code is not hard to write if you know basic principles.

zlacker