zlacker

Std: Clamp generates less efficient assembly than std:min(max,std:max(min,v))

submitted by x1f604+(OP) on 2024-01-16 10:53:15 | 174 points 142 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
2. tambre+2i[view] [source] 2024-01-16 13:21:16
>>x1f604+(OP)
Both recent GCC and Clang are able to generate the most optimal version for std::clamp() if you add something like -march=znver1, even at -O1 [0]. Interesting!

[0] https://godbolt.org/z/YsMMo7Kjz

9. celega+im[view] [source] 2024-01-16 13:50:05
>>x1f604+(OP)
On gcc 13, the difference in assembly between the min(max()) version and std::clamp is eliminated when I add the -ffast-math flag. I suspect that the two implementations handle one of the arguments being NaN a bit differently.

https://gcc.godbolt.org/z/fGaP6roe9

I see the same behavior on clang 17 as well

https://gcc.godbolt.org/z/6jvnoxWhb

12. planed+6n[view] [source] 2024-01-16 13:55:09
>>x1f604+(OP)
On a somewhat similar note, don't use std::lerp if you don't need its strong guarantees around rounding (monotonicity among other things).

https://godbolt.org/z/hzrG3s6T4

◧◩
14. jeffbe+yo[view] [source] [discussion] 2024-01-16 14:06:15
>>fooker+Ah
FYI. https://quick-bench.com/q/sK9t9GoFDRkx9XxloUUbB8Q3ht4'

Using this microbenchmark on an Intel Sapphire Rapids CPU, compiled with march=k8 to get the older form, takes ~980ns, while compiling with march=native gives ~570ns. It's not at all clear that the imperfection the article describes is really relevant in context, because the compiler transforms this function into something quite different.

16. camblo+3q[view] [source] 2024-01-16 14:17:19
>>x1f604+(OP)
I did a double take on this because I wrote a blog post about this topic a few months ago and came to a very different conclusion, that the results are effectively identical on clang and gcc is just weird.

Then I realized that I was writing about compiling for ARM and this post is about x86. Which is extra weird! Why is the compiler better tuned for ARM than x86 in this case?

Never did figure out what gcc's problem was.

https://godbolt.org/z/Y75qnTGdr

26. nickys+nw[view] [source] 2024-01-16 14:51:21
>>x1f604+(OP)
https://bugs.llvm.org/show_bug.cgi?id=47271

This specific test (click the godbolt links) does not reproduce the issue.

◧◩◪◨
35. plucas+0z[view] [source] [discussion] 2024-01-16 15:05:29
>>PaulDa+Fu
https://en.wikipedia.org/wiki/Trust,_but_verify

> Trust, but verify (Russian: доверяй, но проверяй, tr. doveryay, no proveryay, IPA: [dəvʲɪˈrʲæj no prəvʲɪˈrʲæj]) is a Russian proverb, which is rhyming in Russian. The phrase became internationally known in English after Suzanne Massie, a scholar of Russian history, taught it to Ronald Reagan, then president of the United States, the latter of whom used it on several occasions in the context of nuclear disarmament discussions with the Soviet Union.

Memorably referenced in "Chernobyl": https://youtu.be/9Ebah_QdBnI?t=79

◧◩◪◨
36. protom+Mz[view] [source] [discussion] 2024-01-16 15:09:16
>>PaulDa+Fu
Russian here. We use that expression from time to time: https://ya.ru/search/?text="доверяй%2C+но+проверяй"
◧◩◪◨⬒
46. borodi+bH[view] [source] [discussion] 2024-01-16 15:41:34
>>pavlov+5D
So it was a bit more pervasive than this, the issue was that flushing subnormals (values very close to 0) to 0 is a register that gets set, so if a library is built with the fastmath flags and it gets loaded, it sets the register, causing the whole process to flush it's subnormals. i.e https://github.com/llvm/llvm-project/issues/57589
48. cmovq+FH[view] [source] 2024-01-16 15:42:56
>>x1f604+(OP)
Depending on the order of the arguments to min max you'll get an extra move instruction [1]:

std::min(max, std::max(min, v));

        maxsd   xmm0, xmm1
        minsd   xmm0, xmm2
std::min(std::max(v, min), max);

        maxsd   xmm1, xmm0
        minsd   xmm2, xmm1
        movapd  xmm0, xmm2
For min/max on x86 if any operand is NaN the instruction copies the second operand into the first. So the compiler can't reorder the second case to look like the first (to leave the result in xmm0 for the return value).

The reason for this NaN behavior is that minsd is implemented to look like `(a < b) ? a : b`, where if any of a or b is NaN the condition is false, and the expression evaluates to b.

Possibly std::clamp has the comparisons ordered like the second case?

[1]: https://godbolt.org/z/coes8Gdhz

◧◩◪◨⬒⬓⬔⧯
63. jcranm+LO[view] [source] [discussion] 2024-01-16 16:12:20
>>dahart+GI
https://github.com/llvm/llvm-project/issues/57589

Turn on fast-math, it flips the FTZ/DAZ bit for the entire application. Even if you turned it on for just a shared library!

◧◩
75. vitors+tZ[view] [source] [discussion] 2024-01-16 17:04:47
>>cmovq+FH
It seems that this is close to the most likely reason. See also:

https://godbolt.org/z/q7e3MrE66

◧◩◪◨⬒⬓⬔
99. zozbot+9N1[view] [source] [discussion] 2024-01-16 20:33:37
>>jcranm+KL1
> It's actually not even all math library functions, just those that are like sin, pow, exp, etc., but specifically excluding things like sqrt. I'm still trying to come up with a good term to encompass these.

Transcendental functions. They're called that because computing an exactly rounded result might be unfeasible for some inputs. https://en.wikipedia.org/wiki/Table-maker%27s_dilemma So standards for numerical compute punt on the issue and allow for some error in the last digit.

◧◩◪◨⬒⬓⬔
105. fl0ki+iQ1[view] [source] [discussion] 2024-01-16 20:50:22
>>jcranm+KL1
Not sure if this is a spooky coincidence, but I happened to be reading the Rust 1.75.0 release notes today and fell into this 50-tab rabbit hole: https://github.com/rust-lang/rust/pull/113053/
◧◩◪◨⬒
107. Maulin+ZS1[view] [source] [discussion] 2024-01-16 21:03:16
>>Sharli+4I
As 1 of ∞ examples of UB land, I once had to debug JS objects being misinterpreted as numbers when https://duktape.org/ was miscompiled with a fast-math equivalent (references to objects were encoded as NaNs.)
◧◩◪◨⬒
113. dahart+E72[view] [source] [discussion] 2024-01-16 22:25:44
>>light_+Rr1
I only do numeric computation, I don’t work in machine learning. Sorry your assumptions are incorrect, maybe it’s best not to assume or attack. I didn’t exactly advise using fast math either, I asked for reasoning and pointed out that most casual uses of float aren’t highly sensitive to precision.

It’s easy to have wrong sums and catastrophic cancellation without fast math, and it’s relatively rare for fast math to cause those issues when an underlying issue didn’t already exist.

I’ve been working in some code that does a couple of quadratic solves and has high order intermediate terms, and I’ve tried using Kahan’s algorithm repeatedly to improve the precision of the discriminants, but it has never helped at all. On the other hand I’ve used a few other tricks that improve the precision enough that the fast math version is higher precision than the naive one without fast math. I get to have my cake and eat it too.

Fast math is a tradeoff. Of course it’s a good idea to know what it does and what the risks of using it are, but at least in terms of the accuracy of fast math in CUDA, it’s not an opinion whether the accuracy is relatively close to slow math, it’s reasonably well documented. You can see for yourself that most fast math ops are in the single digit ulps of rounding error. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

◧◩◪◨⬒
116. kybore+Bc2[view] [source] [discussion] 2024-01-16 22:55:57
>>fooker+E22
Not prediction, predication: https://en.wikipedia.org/wiki/Predication_(computer_architec...

By avoiding conditional branches and essentially masking out some instructions, you can avoid stalls and mis-predictions and keep the pipeline full.

Actually I think @IainIreland mis-remembers what the seasoned architect told him about Itanium. While Itanium did support predicated instructions, the problematic static scheduling was actually because Itanium was a VLIW machine: https://en.wikipedia.org/wiki/VLIW .

TL;DR: dynamic scheduling on superscalar out-of-order processors with vector units works great and the transistor overhead got increasingly cheap, but static scheduling stayed really hard.

◧◩◪◨⬒⬓⬔⧯▣
125. jcranm+013[view] [source] [discussion] 2024-01-17 04:50:21
>>lifthr+EW2
My understanding is we have exhaustively enumerated the unary binary32 functions and proved the correctness of correct-rounding for them. For binary64, exhaustive enumeration is not a viable strategy, but we generally have a decent idea of what cases end up being hard-to-round, and in a few cases, we may have mechanical proofs of correctness.

There was a paper last year on binary64 pow (https://inria.hal.science/hal-04159652/document) which suggests that they have a correctly-rounded pow implementation, but I don't have enough technical knowledge to assess the validity of the claim.

◧◩◪◨⬒⬓⬔⧯▣▦
126. lifthr+s23[view] [source] [discussion] 2024-01-17 05:08:04
>>jcranm+013
For your information, binary64 has been indeed mapped exhaustively for several functions [1], so it is known that at most triple-double representation is enough for correct rounding.

[1] https://inria.hal.science/inria-00072594/document

> There was a paper last year on binary64 pow (https://inria.hal.science/hal-04159652/document) which suggests that they have a correctly-rounded pow implementation, but I don't have enough technical knowledge to assess the validity of the claim.

Thank you for the pointer. These were written by usual folks you'd expect from such papers (e.g. Paul Zimmermann) so I believe they did achieve significant improvement. Unfortunately it is still not complete, the paper notes that the third and final phase may still fail but is unknown whether it indeed occurs or not. So we will have to wait...

◧◩◪◨⬒
129. accoun+fo3[view] [source] [discussion] 2024-01-17 08:24:37
>>Wander+l21
https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/...

So, yes when targeting VFP math. NEON already always works in this mode though.

◧◩◪◨⬒⬓⬔
130. accoun+ht3[view] [source] [discussion] 2024-01-17 09:01:52
>>jcranm+KL1
> All CPU hardware nowadays conforms to IEEE 754 semantics for binary32 and binary64.

Is this out of date?

https://developer.arm.com/documentation/den0018/a/NEON-Instr...

◧◩
131. lebubu+lt3[view] [source] [discussion] 2024-01-17 09:02:24
>>cmovq+FH
Yes, I arrived at the same conclusion.

The various code snippets in the article don't compute the same "function". The order between the min() and max() matters even when done "by hand". This is apparent when min is greater than max as the results differ in the choice of the boundaries.

Funny that for such simple functions the discussion can become quickly so difficult/interesting.

Some toying around with the various implementations in C [1]:

[1]: https://godbolt.org/z/d4Tcdojx3

◧◩◪◨
132. ChrisR+Iv3[view] [source] [discussion] 2024-01-17 09:20:04
>>mhh__+EQ
Totally agreed. In Julia we use https://github.com/SciML/MuladdMacro.jl all over the place so that way it's contextual and does not bleed into other functions. fast-math changing everything is just... dangerous.
◧◩
138. x1f604+Jd7[view] [source] [discussion] 2024-01-18 09:06:09
>>tambre+2i
Even with -march=znver1 at -O3 the compiler still generates fewer lines of assembly for the incorrect clamp compared to the correct clamp for this "realistic" code:

https://godbolt.org/z/WMKbeq5TY

◧◩
139. x1f604+Qd7[view] [source] [discussion] 2024-01-18 09:07:03
>>jeffbe+Cj
Even with -march=x86-64-v4 at -O3 the compiler still generates fewer lines of assembly for the incorrect clamp compared to the correct clamp for this "realistic" code:

https://godbolt.org/z/hd44KjMMn

◧◩◪
141. x1f604+ze7[view] [source] [discussion] 2024-01-18 09:13:41
>>Grumpy+Wk
I don't think it's a register allocation failure but is in fact necessitated by the ABI requirement (calling convention) for the first parameter to be in xmm0 and the return value to also be placed into xmm0.

So when you have an algorithm like clamp which requires v to be "preserved" throughout the computation you can't overwrite xmm0 with the first instruction, basically you need to "save" and "restore" it which means an extra instruction.

I'm not sure why this causes the extra assembly to be generated in the "realistic" code example though. See https://godbolt.org/z/hd44KjMMn

[go to top]