zlacker

I wonder how well a simple C for loop with -O3 and maybe -march would do here.

From my brief forays into reading (mostly AARCH64) assembly, it looks like C compilers can detect these kinds of patterns now and just convert them all to SIMD by themselves, with no work from the programmer. Even at -O2, converting an index-based loop into one based on start and end pointers is not unusual. Go doesn't seem to do this, the assembly output by the Go compiler looks much closer to the actual code than what you get from C.

Rust iterators would also be fun to benchmark here, they're supposed to be as fast as plain old loops, and they're probably optimized to omit bounds checks entirely.

replies(2): >>mratsi+q3 >>stevek+za

>>miki12+(OP)
It depends.

You need 2~3 accumulators to saturate instruction-level parallelism with a parallel sum reduction. But the compiler won't do it because it only creates those when the operation is associative, i.e. (a+b)+c = a+(b+c), which is true for integers but not for floats.

There is an escape hatch in -ffast-math.

I have extensive benches on this here: https://github.com/mratsim/laser/blob/master/benchmarks%2Ffp...

replies(2): >>menaer+St >>lifthr+eW

>>miki12+(OP)
> Rust iterators would also be fun to benchmark here

I started to write this out, and then thought "you know what given how common this is, I bet I could even just google it" and thought that would be more interesting, as it makes it feel more "real world." The first result I got is what I would have written: https://stackoverflow.com/a/30422958/24817

Here's a godbolt with three different outputs: one at -O, one at -O3, and one at -03 and -march=native

https://godbolt.org/z/6xf9M1cf3

Eyeballing it comments:

Looks like 2 and 3 both provide extremely similar if not identical output.

Adding the native flag ends up generating slightly different codegen, I am not at the level to be able to simply look at that and know how meaningful the difference is.

It does appear to have eliminated the bounds check entirely, and it's using xmm registers.

I am pleasantly surprised at this output because zip in particular can sometimes hinder optimizations, but rustc did a great job here.

----------------------

For fun, I figured "why not also try as direct of the original Go as possible." The only trick here is that Rust doesn't really do the c-style for loop the way Go does, so I tried to translate what I saw as the spirit of the example: compare the two lengths and use the minimum for the loop length.

Here it is: https://godbolt.org/z/cTcddc8Gs

... literally the same. I am very surprised at this outcome. It makes me wonder if LLVM has some sort of idiom recognition for dot product specifically.

EDIT: looks like it does not currently, see the comment at line 28 and 29: https://llvm.org/doxygen/LoopIdiomRecognize_8cpp_source.html

replies(2): >>camel-+zd >>Measte+ss2

>>stevek+za
Those aren't vectorized at all, just unrolled. vmulss/vaddss just multiply/add the single-precision floating-point in the vector register.

With clang you get basically the same codegen, although it uses fused multiply adds.

The problem is that you need to enable -ffast-math, otherwise the compiler can't change the order of floating point operations, and thus not vectorize.

With clang that works wonderfully and it gives us a lovely four times unrolled AVX2 fused multiply add loop, but enabling it in rust doesn't seem to work: https://godbolt.org/z/G4Enf59Kb

Edit: from what I can tell this is still an open issue??? https://github.com/rust-lang/rust/issues/21690

Edit: relevant SO post: https://stackoverflow.com/questions/76055058/why-cant-the-ru... Apparently you need to use `#![feature(core_intrinsics)]`, `std::intrinsics::fadd_fast` and `std::intrinsics::fmul_fast`.

replies(3): >>stevek+ch >>nemoth+hi >>srouss+Cp

>>camel-+zd
Ahh yes, thank you.

Rust doesn't have a -ffast-math flag, though it is interesting that you passed it directly to llvm. I am kinda glad that escape hatch doesn't work, to be honest.

There are currently unstable intrinsics that let you do this, and you seemingly get close to clang codegen with them: https://godbolt.org/z/EEW79Gbxv

The thread tracking this discusses another attempt at a flag to enable this by turning on the CPU feature directly, but that doesn't seem to affect codegen in this case. https://github.com/rust-lang/rust/issues/21690

It would be nice to get these intrinsics stabilized, at least.

EDIT: oops you figured this out while I was writing it, haha.

replies(1): >>cbm-vi+Cn

>>camel-+zd
It was just last week I was reading a comment that made it seem like you shouldn't really use -ffast-math[0], but here looks like a non-rare reason why you would want to enable it.

What is correct idiom here? It feels if this sort of thing really matters to you, you should have the know how to handroll a couple lines of ASM. I want to say this is rare, but I had a project a couple years ago where I needed to handroll some vectorized instructions on a raspberry pi.

[0] >>39013277

replies(7): >>stevek+Qj >>camel-+il >>sevagh+tn >>the847+Os >>jkafja+fw >>miki12+gw >>orlp+zz

>>nemoth+hi
The right path forward for Rust here in my opinion is to do the same thing as is done for math operations like saturating: stabilize a function or method that performs the operation with this semantic, and then build thin wrappers on top to make using them more convenient.

>>nemoth+hi
Imo the best solution would be special "fast" floating-point types, that have less strict requirements.

I personal almost always use -ffast-math by default in my C programs that care about performance, because I almost never care enough about the loss in accuracy. The only case I remember needing it was when doing some random number distribution tests where I cared about subnormals, and got confused for a second because they didn't seem to exist (-ffast-math disables them on x86).

replies(1): >>lifthr+1U

>>nemoth+hi
Definitely don't take an HN comment as a serious suggestion. Enable fast-math for your code, run your objective evaluation that's suitable for your domain, and if it passes the test, enjoy the added speed.

FWIW I have oodles of numerical C++ code where fast-math doesn't change the output.

replies(2): >>lifthr+GS >>Dylan1+G71

>>stevek+ch
Even without a -ffast-math flag, the current stable Rust compiler will vectorize loops on integer types.

https://godbolt.org/z/KjErzacfv

Edit: ...and I now realize who I responded to, I'm sure you already know this. :)

replies(1): >>galang+Iy

>>camel-+zd
fast-math changes even libraries not compiled with it, so buyer beware.

>>nemoth+hi
Usually one wants a subset of the thing that -ffast-math does, e.g. -fassociative-math. And only within some limited scope.

replies(1): >>camel-+iu

>>mratsi+q3
In my experience, compilers rarely know how to make use of ILP even in some for what you would expect to be the "simple" cases. Handwriting the SIMD, at least in my case, almost always proved to be several times faster than the auto-vectorized code generated by the compiler.

replies(1): >>mratsi+iG

>>the847+Os
I played around with the example a bit, the minimum for vectorization seems to be -fassociative-math -fno-signed-zeros. The gcc docs say -fassociative-math requries -fno-signed-zeros and -fno-trapping-math though.

I suppose -fassociative-math -fno-signed-zeros -fno-trapping-math -freciprocal-math will get you most of the way there, and maybe an -ffinite-math-only when appropriate.

>>nemoth+hi
The usual technique is to keep a 4-element array of sums (so sum[j] is the sum of all terms of the form a[4*i + j] * b[4*i + j]), and then take the total at the very end. This allows for the use of vectorization even with strict IEEE-compliance.

Generally, I would recommend against -ffast-math mostly because it enables -ffinite-math-only and that one can really blow up in your face. Most other flags (like -funsafe-math-operations) aren't that bad from an accuracy standpoint. Obviously you should not turn them on for code that you have actually tuned to minimize error, but in other cases they barely ever degrade the results.

>>nemoth+hi
> you should have the know how to handroll a couple lines of ASM

For what architecture? What if this code is in a library that your users might want to run on Intel (both 32 and 64 bit), ARM, Risc V and s390x? Even if you learn assembly for all of these, how are you going to get access to an S390X IBM mainframe to test your code? What if a new architecture[1] gets popular in the next couple of years, and you won't have access to a CPU to test on?

Leaving this work to a compiler or architecture-independent functions / macros that use intrinsics under the hood frees you from having to think about all of that. As long as whatever the user is running on has decent compiler support, your code is going to work and be fast, even years later.

[1] https://en.wikipedia.org/wiki/Loongson

>>cbm-vi+Cn
Loops on floats are fine, its just reduction operations that hit the issue with the associativity assumption for floats leading to UB. You can trick it by making f32x16 types like the wide crate does or if you use nightly simba can do it with const generic expressions.

>>nemoth+hi
You can write it like this to get the compiler to generate SIMD: https://godbolt.org/z/ohvoEb7er

It's certainly not perfect though (in particular the final reduction/remainder handling).

Unfortunately Rust doesn't have a proper optimizing float type. I really wish there was a type FastF32 or something similar which may be optimized using the usual transformation rules of algebra (e.g. associative property, distributive property, x + y - y = x, etc).

There is fadd_fast and co, but those are UB on NaN/infinite input.

>>menaer+St
They do reorder instructions. I think the SIMD part has more to do with loop analysis than ILP.

It's quite telling that there is a #pragma omp simd to hint to a compiler to rewrite the loop.

Now I wonder what's the state of polyhedral compilers. It's been many years. And given the AI, LLMs hype they could really shine.

replies(1): >>menaer+rr1

>>sevagh+tn
For a very long time `-funsafe-math-optimizations` (and thus `-ffast-math`) had been infectious [1], so a responsible library should never have used `-ffast-math` anyway.

You are right in that the final binary is free to turn `-ffast-math` on if you can verify that everything went okay. But almost no one would actually verify that. It's like an advice that you shouldn't write your own crypto code---it's fine if you know what you are doing, but almost no one does, so the advice is technically false but still worthwhile.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522 (GCC), https://github.com/llvm/llvm-project/issues/57589 (LLVM)

>>camel-+il
That or a scoped optimization directive. GCC does allow `__attribute__((optimize("-ffast-math")))` as a function-wide attribute, but Clang doesn't seem to have an equivalent and the standard syntax `[[gcc::optimize("-ffast-math")]]` doesn't seem to work as well. In any case, such optimization should be visible from the code in my opinion.

replies(1): >>lorenz+hf1

>>mratsi+q3
You don't necessarily need `-ffast-math` if you are willing to tweak your code a bit, since you can just introduce accumulators by yourself. Such optimizer-friendly code is not hard to write if you know basic principles.

replies(1): >>mratsi+3j1

>>sevagh+tn
That sounds like trying to run a program to check if it has underground behavior. How do you make a test that's comprehensive and future-compiler-safe?

>>lifthr+1U
The problem is that it takes a single piece of code compiled with -ffast-math to break everything, it's simply not worth it

replies(1): >>maskli+rh1

>>lorenz+hf1
GP seems to be saying that you can flag individual functions in GCC, thereby avoiding this issue: only flagged functions would be compiled with fast math semantics.

replies(1): >>lorenz+yi1

>>maskli+rh1
This only work if it's a leaf function that will throw away the result. If you feed the result of your --fast-math function into other working code you risk breaking it.

replies(1): >>lifthr+Wm1

>>lifthr+eW
That's what I'm saying though. Either youbdo the accumulators yourself, or you need a compiler escape hatch.

replies(1): >>lifthr+fm1

>>mratsi+3j1
I mean, you don't need SIMD to put accumulators.

replies(1): >>mratsi+QC4

>>lorenz+yi1
`-ffast-math` is fully local, asides from GCC's unexpected `crtfastmath.o` linkage which is global.

Functions with `-ffast-math` enabled still return fp values via usual registers and in usual formats. If some function `f` is expected to return -1.0 to 1.0 for particluar inputs, `-ffast-math` can only make it to return 1.001 or NaN instead. If another function without `-ffast-math` expects and doesn't verify f's return value, it will surely misbehave, but only because the original analysis of f no longer holds.

`-ffast-math` the compiler option is bad because this effect is not evident from the code. Anything visible in the code should be okay.

>>mratsi+iG
> I think the SIMD part has more to do with loop analysis than ILP.

If you know how to rewrite the algorithm in such a way so that it makes close-to-ideal utilization of CPU ports through your SIMD then it is practically impossible to beat it. And I haven't seen a compiler (GCC, clang) doing such a thing or at least not in the instances I had written. I've measured substantial improvements from such and similar utilization of CPU-level microarchitectural details. So perhaps I don't think it's the loop analysis only but I do think it's practically an impossible task for the compiler. Perhaps with the AI ...

>>stevek+za
I had a go at writing this, both the initial float version and the later integer-based version. Note that I haven't actually tested these to make sure that output is correct, so, I might have messed up somewhere, and that I targeted Rocketlake for the CPU.

For the float-based version[0] I had to break out the unstable portable_simd to get it to vectorize. Most of the function ends up being setting up everything, but then actually doing the calculation is simple, and basically the same as non-SIMD section. I've never used the portable SIMD stuff before, and it was quite pleasant to use.

For the integer-based version, I started with the simple naive approach[1], and that vectorized to a pretty good degree on stable. However, it doesn't use the dot-product instruction. For that, I think we need to use nightly and go a bit manual[2]. Unsurprisingly, it mostly ends up looking like the float version as a fair chunk is just setup. I didn't bother here, but it should probably be using feature detection to make sure the instruction exists.

[0] https://godbolt.org/z/Gdv8azorW [1] https://godbolt.org/z/d8jv3ofYo [2] https://godbolt.org/z/4oYEnKTbf

>>lifthr+fm1
The compiler escape hatch is -ffast-math