From slow to SIMD: A Go optimization story

>>rbanff+(OP)
I wonder how well a simple C for loop with -O3 and maybe -march would do here.

From my brief forays into reading (mostly AARCH64) assembly, it looks like C compilers can detect these kinds of patterns now and just convert them all to SIMD by themselves, with no work from the programmer. Even at -O2, converting an index-based loop into one based on start and end pointers is not unusual. Go doesn't seem to do this, the assembly output by the Go compiler looks much closer to the actual code than what you get from C.

Rust iterators would also be fun to benchmark here, they're supposed to be as fast as plain old loops, and they're probably optimized to omit bounds checks entirely.

>>miki12+ND
> Rust iterators would also be fun to benchmark here

I started to write this out, and then thought "you know what given how common this is, I bet I could even just google it" and thought that would be more interesting, as it makes it feel more "real world." The first result I got is what I would have written: https://stackoverflow.com/a/30422958/24817

Here's a godbolt with three different outputs: one at -O, one at -O3, and one at -03 and -march=native

https://godbolt.org/z/6xf9M1cf3

Eyeballing it comments:

Looks like 2 and 3 both provide extremely similar if not identical output.

Adding the native flag ends up generating slightly different codegen, I am not at the level to be able to simply look at that and know how meaningful the difference is.

It does appear to have eliminated the bounds check entirely, and it's using xmm registers.

I am pleasantly surprised at this output because zip in particular can sometimes hinder optimizations, but rustc did a great job here.

----------------------

For fun, I figured "why not also try as direct of the original Go as possible." The only trick here is that Rust doesn't really do the c-style for loop the way Go does, so I tried to translate what I saw as the spirit of the example: compare the two lengths and use the minimum for the loop length.

Here it is: https://godbolt.org/z/cTcddc8Gs

... literally the same. I am very surprised at this outcome. It makes me wonder if LLVM has some sort of idiom recognition for dot product specifically.

EDIT: looks like it does not currently, see the comment at line 28 and 29: https://llvm.org/doxygen/LoopIdiomRecognize_8cpp_source.html

>>stevek+mO
Those aren't vectorized at all, just unrolled. vmulss/vaddss just multiply/add the single-precision floating-point in the vector register.

With clang you get basically the same codegen, although it uses fused multiply adds.

The problem is that you need to enable -ffast-math, otherwise the compiler can't change the order of floating point operations, and thus not vectorize.

With clang that works wonderfully and it gives us a lovely four times unrolled AVX2 fused multiply add loop, but enabling it in rust doesn't seem to work: https://godbolt.org/z/G4Enf59Kb

Edit: from what I can tell this is still an open issue??? https://github.com/rust-lang/rust/issues/21690

Edit: relevant SO post: https://stackoverflow.com/questions/76055058/why-cant-the-ru... Apparently you need to use `#![feature(core_intrinsics)]`, `std::intrinsics::fadd_fast` and `std::intrinsics::fmul_fast`.

>>camel-+mR
It was just last week I was reading a comment that made it seem like you shouldn't really use -ffast-math[0], but here looks like a non-rare reason why you would want to enable it.

What is correct idiom here? It feels if this sort of thing really matters to you, you should have the know how to handroll a couple lines of ASM. I want to say this is rare, but I had a project a couple years ago where I needed to handroll some vectorized instructions on a raspberry pi.

[0] >>39013277

>>nemoth+4W
Imo the best solution would be special "fast" floating-point types, that have less strict requirements.

I personal almost always use -ffast-math by default in my C programs that care about performance, because I almost never care enough about the loss in accuracy. The only case I remember needing it was when doing some random number distribution tests where I cared about subnormals, and got confused for a second because they didn't seem to exist (-ffast-math disables them on x86).

>>camel-+5Z
That or a scoped optimization directive. GCC does allow `__attribute__((optimize("-ffast-math")))` as a function-wide attribute, but Clang doesn't seem to have an equivalent and the standard syntax `[[gcc::optimize("-ffast-math")]]` doesn't seem to work as well. In any case, such optimization should be visible from the code in my opinion.

>>lifthr+Ox1
The problem is that it takes a single piece of code compiled with -ffast-math to break everything, it's simply not worth it

>>lorenz+4T1
GP seems to be saying that you can flag individual functions in GCC, thereby avoiding this issue: only flagged functions would be compiled with fast math semantics.

>>maskli+eV1
This only work if it's a leaf function that will throw away the result. If you feed the result of your --fast-math function into other working code you risk breaking it.

>>lorenz+lW1
`-ffast-math` is fully local, asides from GCC's unexpected `crtfastmath.o` linkage which is global.

Functions with `-ffast-math` enabled still return fp values via usual registers and in usual formats. If some function `f` is expected to return -1.0 to 1.0 for particluar inputs, `-ffast-math` can only make it to return 1.001 or NaN instead. If another function without `-ffast-math` expects and doesn't verify f's return value, it will surely misbehave, but only because the original analysis of f no longer holds.

`-ffast-math` the compiler option is bad because this effect is not evident from the code. Anything visible in the code should be okay.

zlacker