What is correct idiom here? It feels if this sort of thing really matters to you, you should have the know how to handroll a couple lines of ASM. I want to say this is rare, but I had a project a couple years ago where I needed to handroll some vectorized instructions on a raspberry pi.
[0] >>39013277
I personal almost always use -ffast-math by default in my C programs that care about performance, because I almost never care enough about the loss in accuracy. The only case I remember needing it was when doing some random number distribution tests where I cared about subnormals, and got confused for a second because they didn't seem to exist (-ffast-math disables them on x86).
FWIW I have oodles of numerical C++ code where fast-math doesn't change the output.
I suppose -fassociative-math -fno-signed-zeros -fno-trapping-math -freciprocal-math will get you most of the way there, and maybe an -ffinite-math-only when appropriate.
Generally, I would recommend against -ffast-math mostly because it enables -ffinite-math-only and that one can really blow up in your face. Most other flags (like -funsafe-math-operations) aren't that bad from an accuracy standpoint. Obviously you should not turn them on for code that you have actually tuned to minimize error, but in other cases they barely ever degrade the results.
For what architecture? What if this code is in a library that your users might want to run on Intel (both 32 and 64 bit), ARM, Risc V and s390x? Even if you learn assembly for all of these, how are you going to get access to an S390X IBM mainframe to test your code? What if a new architecture[1] gets popular in the next couple of years, and you won't have access to a CPU to test on?
Leaving this work to a compiler or architecture-independent functions / macros that use intrinsics under the hood frees you from having to think about all of that. As long as whatever the user is running on has decent compiler support, your code is going to work and be fast, even years later.
It's certainly not perfect though (in particular the final reduction/remainder handling).
Unfortunately Rust doesn't have a proper optimizing float type. I really wish there was a type FastF32 or something similar which may be optimized using the usual transformation rules of algebra (e.g. associative property, distributive property, x + y - y = x, etc).
There is fadd_fast and co, but those are UB on NaN/infinite input.
You are right in that the final binary is free to turn `-ffast-math` on if you can verify that everything went okay. But almost no one would actually verify that. It's like an advice that you shouldn't write your own crypto code---it's fine if you know what you are doing, but almost no one does, so the advice is technically false but still worthwhile.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522 (GCC), https://github.com/llvm/llvm-project/issues/57589 (LLVM)
Functions with `-ffast-math` enabled still return fp values via usual registers and in usual formats. If some function `f` is expected to return -1.0 to 1.0 for particluar inputs, `-ffast-math` can only make it to return 1.001 or NaN instead. If another function without `-ffast-math` expects and doesn't verify f's return value, it will surely misbehave, but only because the original analysis of f no longer holds.
`-ffast-math` the compiler option is bad because this effect is not evident from the code. Anything visible in the code should be okay.