Compilers are pretty skittish about changing the order of floating point operations (for good reason) and ffast-math is the thing that lets them transform equations to try and generate faster code.
IE, instead of doing "n / 10" doing "n * 0.1". The issue, of course, being that things like 0.1 can't be perfectly represented with floats but 100 / 10 can be. So now you've introduced a tiny bit of error where it might not have existed.
I deal with a lot of floating point professionally day to day, and I use fast math all the time, since the tradeoff for higher performance and the relatively small loss of accuracy are acceptable. Maybe the biggest issue I run into is lack of denorms with CUDA fast-math, and it’s pretty rare for me to care about numbers smaller than 10^-38. Heck, I’d say I can tolerate 8 or 16 bits of mantissa most of the time, and fast-math floats are way more accurate than that. And we know a lot of neural network training these days can tolerate less than 8 bits of mantissa.
I would think practically all modern FPU code on x86-64 would be using the SIMD registers which have explicit widths.
'slightly'? Last I checked, -Ofast completely breaks std::isnan and std::isinf--they always return false.
We are fortunately starting to see newer (well, not that new now) CPU instructions like FMA that make more accurate decimal representations not take such huge performance hits.
Can you elaborate? What fast-math can sneak into a library that disabled fast-math at compile time?
> fast-math enables some dynamic initialization of the library that changes the floating point environment in some ways.
I wasn’t aware of this, I would love to see some documentation discussing exactly what happens, can you send a link?
But you're correct that it's probably usually fine in practice.
A lot of library code is in headers (especially in C++!). The code in headers is compiled by your compiler using your compile options.
Which compiler are you using where std::isinf breaks? Hopefully it was also clear that my experience leans toward CUDA, and I think the inf & nan support works there in the presence of NVCC’s fast-math.
But yeah, it's probably a good idea to not put code which breaks under -ffast-math in headers if possible.
* It links in an object file that enables denormal flushing globally, so that it affects all libraries linked into your application, even if said library explicitly doesn't want fast-math. This is seriously one of the most user-hostile things a compiler can do.
* The results of your program will vary depending on the exact make of your compiler and other random attributes of your compile environment, which can wreak havoc if you have code that absolutely wants bit-identical results. This doesn't matter for everybody, but there are some domains where this can be a non-starter (e.g., multiplayer game code).
* Fast-math precludes you from being able to use NaN or infinities, and often even being able to defensively test for NaN or infinity. Sure, there are times where this is useful, but an option you might generally prefer to suggest for an uninformed programmer would rather be a "floating-point code can't overflow" option rather than "infinity doesn't exist and it's UB if it does exist".
* Fast-math can cause hard range guarantees to fail. Maybe you've got code that you can prove that, even with rounding error, the result will still be >= 0. With fast-math, the code might be adjusted so that the result is instead, say, -1e-10. And if you pass that to a function with a hard domain error at 0 (like sqrt), you now go from the result being 0 to the result being NaN. And see above about what happens when you get NaN.
Fast-math is a tradeoff, and if you're willing to except the tradeoff it offers, it's a fine option to use. But most programmers don't even know what the tradeoffs are, and the failure mode can be absolutely catastrophic. It's definitely an option that is in the "you must be this knowledgeable to use" camp.
Hey I grant and acknowledge that using fast-math carries a little risk of surprises, we don’t necessarily need to try to think of corner cases. I’m mostly pushing back a little because using floats at all carries almost as much risk. A lot of people seem to use floats without knowing how inaccurate floats are, and a lot of people aren’t doing precision analysis or handling the exceptional cases… and don’t really need to.
Turn on fast-math, it flips the FTZ/DAZ bit for the entire application. Even if you turned it on for just a shared library!
Actually, no, the x87 FPU instructions are the only ones that won't be affected.
It sets the FTZ/DAZ bits, which exist for SSE instructions but not x87 instructions.
fast-math is one of the dumbest things we have as an industry IMO.
Even if I don't care about the accuracy differences, I still need a way to check for invalid input data. The upshot is that I had to roll my own isnan and isinf to be able to use -Ofast (because it's actually the underlying __builtin_xxx intrinsics that are broken), which still seems wrong to me.
> Fast-math can cause hard range guarantees to fail. Maybe you’ve got code that you can prove that, even with rounding error, the result will still be >= 0.
Floats do this too, it’s pretty routine to bump into epsilon out-of-range issues without fast-math. Most people don’t prove things about their rounding error, and if they do, it’s easy for them to account for 3 ULPs of fast-math error compared to 1/2 ULP for the more accurate operations. Like, nobody who knows what they’re doing will call sqrt() on a number that is fresh out of a multiplier and might be anywhere near zero without testing for zero explicitly, right? I’m sure someone has done it, but I’ve never seen it, and it ranks high on the list of bad ideas even if you steer completely clear of fast-math, no?
I guess I just wanted to resist the unspecific parts of the FUD just a little bit. I like your list a lot because it’s specific. Fast-math does carry some additional risks for accuracy sensitive code, and clearly as you and others showed, can infect and impact your whole app, and it can sometimes lead to situations where things break that wouldn’t have happened otherwise. But I think in the grand scheme these situations are quite rare compared to how often people mess up regular floating point math. For a very wide swath of people doing casual arithmetic, fast-math is not likely to cause more problems than floats cause, but it’s fair to want to be careful and pay attention.
and yet, for audio processing, this is an option that most DAWs either implement silently, or offer users the choice, because denormals are inevitable in reverb tails and on most Intel processors they slow things by orders of magnitude.
Really it'll be the SIMD style instructions that speeds things up.
Small nit, but floats aren't inaccurate, they have non uniform precision. Some float operations can be inaccurate, but that's rather path dependent...
One problem with -ffast-math is that a) it sounds appealing and b) people don't understand floats, so lots of people turn it on without understanding what it does, and that can introduce subtle problems in code they didn't write.
Sometimes in computational code it makes sense e.g. to get rid of denorms, but a very small fraction of programmers understand this properly, or ever will.
I wish they had named it something scary sounding.
People who deal with actual numerical computing know that the statement "fast math is only slightly less accurate" is absurd. Fast math is unbounded in its inaccuracy! It can reorder your computations so that something that used to sum to 1 now sums to 0, it can cause catastrophic cancellation, etc.
Please stop giving people terrible advice on a topic you're totally unfamiliar with.
-Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math, -fallow-store-data-races and the Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens. It turns off -fsemantic-interposition.
This already shouldn't be assumed, because even the same code, compiler, and flags can produce different floating point results on different CPU targets. With the world increasingly split over x86_64 and aarch64, with more to come, it would be unwise to assume they produce the same exact numbers.
Often this comes down to acceptable implementation defined behavior, e.g. temporarily using an 80-bit floating register despite the result being coerced to 64 bits, or using an FMA instruction that loses less precision than separate multiply and add instructions.
Portable results should come from integers (even if used to simulate rationals and fixed point), not floats. I understand that's not easy with multiplayer games, but doing so with floats is simply impossible because of what is left as implementation-defined in our language standards.
I believe this is "always" rather than often when it comes to the actual operations defined by the FP standard. gcc does play it fast and loose (as -ffast-math is not yet enabled by default, and FMA on the other hand is), but this is technically illegal and at least can be easily configured to be in standards-compliant mode.
I think the bigger problem comes from what is _not_ documented by the standard. E.g. transcendental functions. A program calling plain old sqrt(x) can find itself behaving differently _even between different stepping of the same core_, not to mention that there are well-known differences between AMD vs Intel. This is all using the same binary.
All CPU hardware nowadays conforms to IEEE 754 semantics for binary32 and binary64. (I think all the GPUs now have non-denormal-flushing modes, but my GPU knowledge is less deep). All compilers will have a floating-point mode that preserves IEEE 754 semantics assuming that FP exceptions are unobservable and rounding mode is the default, and this is usually the default (icc/icx is unusual in making fast-math the default).
Thus, you have portability of floating-point semantics, subject to caveats:
* The math library functions [1] are not the same between different implementations. If you want portability, you need to ensure that you're using the exact same math library on all platforms.
* NaN payloads are not consistent on different platforms, or necessarily within the same platform due to compiler optimizations. Note that not even IEEE 754 attempts to guarantee NaN payload stability.
* Long double is not the same type on different platforms. Don't use it. Seriously, don't.
* 32-bit x86 support for exact IEEE 754 equivalence is essentially a "known-WONTFIX" bug. (This is why the C standard implemented FLT_EVAL_METHOD). The x87 FPU evaluates everything in 80-bit precision, and while you can make this work for binary32 easily (double rounding isn't an issue), though with some performance cost (the solution involves reading/writing from memory after every operation), it's not so easy for binary64. However, the SSE registers do implement IEEE 754 exactly, and are present on every chip old enough to drink, so it's not really a problem anymore. There's a subsidiary issue that the x86-32 ABI requires floats be returned in x87 registers, which means you can't properly return an sNaN correctly, but sNaN and floating-point exceptions are firmly in the realm of nonportability anyways.
In short, if you don't need to care about 32-bit x86 support (or if you do care but can require SSE2 support), and you don't care about NaNs, and you bring your own libraries along, you can absolutely expect to have floating-point portability.
[1] It's actually not even all math library functions, just those that are like sin, pow, exp, etc., but specifically excluding things like sqrt. I'm still trying to come up with a good term to encompass these.
Transcendental functions. They're called that because computing an exactly rounded result might be unfeasible for some inputs. https://en.wikipedia.org/wiki/Table-maker%27s_dilemma So standards for numerical compute punt on the issue and allow for some error in the last digit.
(The main defining factor is if they're an IEEE 754 §5 operation or not, but IEEE 754 isn't a freely-available standard.)
Yes, and it could very well be that the correct answer is actually 0 and not 1.
Unless you write your code to explicitly account for fp associativity effects, in which case you don't need generic forum advice about fast-math.
No, it's not. gcc itself still defaults to fp-contract=fast. Or at least does in all versions I have ever tried.
It’s easy to have wrong sums and catastrophic cancellation without fast math, and it’s relatively rare for fast math to cause those issues when an underlying issue didn’t already exist.
I’ve been working in some code that does a couple of quadratic solves and has high order intermediate terms, and I’ve tried using Kahan’s algorithm repeatedly to improve the precision of the discriminants, but it has never helped at all. On the other hand I’ve used a few other tricks that improve the precision enough that the fast math version is higher precision than the naive one without fast math. I get to have my cake and eat it too.
Fast math is a tradeoff. Of course it’s a good idea to know what it does and what the risks of using it are, but at least in terms of the accuracy of fast math in CUDA, it’s not an opinion whether the accuracy is relatively close to slow math, it’s reasonably well documented. You can see for yourself that most fast math ops are in the single digit ulps of rounding error. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....
My DAW uses both "denormals are zero" and "flush denormals to zero" to try to avoid them; it also offers a "DC Bias" option where extremely small values are added to samples to avoid denormals.
Unless of course we are talking about the 80 bit format.
If that's not the case, would be interested to know where they differ.
Unfortunately for the transcendental function the accuracy still hasn't been pinned down, especially since that's still an ongoing research problem.
There's been some great strides in figuring out the worst cases for binary floating point up to doubles so hopefully an upcoming standard will stipulate 0.5 ULP for transcendentals. But decimal floating point still has a long way to go.
The slowing down on Intel platforms has always frustrated me because denorms provide nice smoothing around 0.
At the same time it was nice only having to consider normal floating point when trying to get more accuracy out of calculations, etc.
Every 754 architecture (including SSE) I've worked on has an accurate sqrt().
I'm assuming you're talking about with "fast math" enabled? In which case all bets are off anyway!
Now, there is also often an approximate rsqrt and approximate reciprocal, with varying degrees of accuracy, and that can be "fun."
There was a paper last year on binary64 pow (https://inria.hal.science/hal-04159652/document) which suggests that they have a correctly-rounded pow implementation, but I don't have enough technical knowledge to assess the validity of the claim.
[1] https://inria.hal.science/inria-00072594/document
> There was a paper last year on binary64 pow (https://inria.hal.science/hal-04159652/document) which suggests that they have a correctly-rounded pow implementation, but I don't have enough technical knowledge to assess the validity of the claim.
Thank you for the pointer. These were written by usual folks you'd expect from such papers (e.g. Paul Zimmermann) so I believe they did achieve significant improvement. Unfortunately it is still not complete, the paper notes that the third and final phase may still fail but is unknown whether it indeed occurs or not. So we will have to wait...
Is this out of date?
https://developer.arm.com/documentation/den0018/a/NEON-Instr...
"Some times" here being almost all the time. It is rare that your code will break without denormals if it doesn't already have precision problems with them.
Or maybe the library you use...
FMAs were difficult. The Visual Studio compiler in particular didn't support purposeful FMAs for SSE instructions so you had to rely on the compiler to recognise and replace multiply-additions. Generally I want FMAs because they're more accurate but I want to control where they go.