https://gcc.godbolt.org/z/fGaP6roe9
I see the same behavior on clang 17 as well
Using this microbenchmark on an Intel Sapphire Rapids CPU, compiled with march=k8 to get the older form, takes ~980ns, while compiling with march=native gives ~570ns. It's not at all clear that the imperfection the article describes is really relevant in context, because the compiler transforms this function into something quite different.
Then I realized that I was writing about compiling for ARM and this post is about x86. Which is extra weird! Why is the compiler better tuned for ARM than x86 in this case?
Never did figure out what gcc's problem was.
This specific test (click the godbolt links) does not reproduce the issue.
> Trust, but verify (Russian: доверяй, но проверяй, tr. doveryay, no proveryay, IPA: [dəvʲɪˈrʲæj no prəvʲɪˈrʲæj]) is a Russian proverb, which is rhyming in Russian. The phrase became internationally known in English after Suzanne Massie, a scholar of Russian history, taught it to Ronald Reagan, then president of the United States, the latter of whom used it on several occasions in the context of nuclear disarmament discussions with the Soviet Union.
Memorably referenced in "Chernobyl": https://youtu.be/9Ebah_QdBnI?t=79
std::min(max, std::max(min, v));
maxsd xmm0, xmm1
minsd xmm0, xmm2
std::min(std::max(v, min), max); maxsd xmm1, xmm0
minsd xmm2, xmm1
movapd xmm0, xmm2
For min/max on x86 if any operand is NaN the instruction copies the second operand into the first. So the compiler can't reorder the second case to look like the first (to leave the result in xmm0 for the return value).The reason for this NaN behavior is that minsd is implemented to look like `(a < b) ? a : b`, where if any of a or b is NaN the condition is false, and the expression evaluates to b.
Possibly std::clamp has the comparisons ordered like the second case?
Turn on fast-math, it flips the FTZ/DAZ bit for the entire application. Even if you turned it on for just a shared library!
Transcendental functions. They're called that because computing an exactly rounded result might be unfeasible for some inputs. https://en.wikipedia.org/wiki/Table-maker%27s_dilemma So standards for numerical compute punt on the issue and allow for some error in the last digit.
It’s easy to have wrong sums and catastrophic cancellation without fast math, and it’s relatively rare for fast math to cause those issues when an underlying issue didn’t already exist.
I’ve been working in some code that does a couple of quadratic solves and has high order intermediate terms, and I’ve tried using Kahan’s algorithm repeatedly to improve the precision of the discriminants, but it has never helped at all. On the other hand I’ve used a few other tricks that improve the precision enough that the fast math version is higher precision than the naive one without fast math. I get to have my cake and eat it too.
Fast math is a tradeoff. Of course it’s a good idea to know what it does and what the risks of using it are, but at least in terms of the accuracy of fast math in CUDA, it’s not an opinion whether the accuracy is relatively close to slow math, it’s reasonably well documented. You can see for yourself that most fast math ops are in the single digit ulps of rounding error. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....
By avoiding conditional branches and essentially masking out some instructions, you can avoid stalls and mis-predictions and keep the pipeline full.
Actually I think @IainIreland mis-remembers what the seasoned architect told him about Itanium. While Itanium did support predicated instructions, the problematic static scheduling was actually because Itanium was a VLIW machine: https://en.wikipedia.org/wiki/VLIW .
TL;DR: dynamic scheduling on superscalar out-of-order processors with vector units works great and the transistor overhead got increasingly cheap, but static scheduling stayed really hard.
There was a paper last year on binary64 pow (https://inria.hal.science/hal-04159652/document) which suggests that they have a correctly-rounded pow implementation, but I don't have enough technical knowledge to assess the validity of the claim.
[1] https://inria.hal.science/inria-00072594/document
> There was a paper last year on binary64 pow (https://inria.hal.science/hal-04159652/document) which suggests that they have a correctly-rounded pow implementation, but I don't have enough technical knowledge to assess the validity of the claim.
Thank you for the pointer. These were written by usual folks you'd expect from such papers (e.g. Paul Zimmermann) so I believe they did achieve significant improvement. Unfortunately it is still not complete, the paper notes that the third and final phase may still fail but is unknown whether it indeed occurs or not. So we will have to wait...
So, yes when targeting VFP math. NEON already always works in this mode though.
Is this out of date?
https://developer.arm.com/documentation/den0018/a/NEON-Instr...
The various code snippets in the article don't compute the same "function". The order between the min() and max() matters even when done "by hand". This is apparent when min is greater than max as the results differ in the choice of the boundaries.
Funny that for such simple functions the discussion can become quickly so difficult/interesting.
Some toying around with the various implementations in C [1]:
So when you have an algorithm like clamp which requires v to be "preserved" throughout the computation you can't overwrite xmm0 with the first instruction, basically you need to "save" and "restore" it which means an extra instruction.
I'm not sure why this causes the extra assembly to be generated in the "realistic" code example though. See https://godbolt.org/z/hd44KjMMn