zlacker

[parent] [thread] 11 comments
1. fooker+(OP)[view] [source] 2024-01-16 13:18:31
If you benchmark these, you'll likely find the version with the jump edges out the one with the conditional instruction in practice.
replies(3): >>svanta+T3 >>pclmul+X5 >>jeffbe+Y6
2. svanta+T3[view] [source] 2024-01-16 13:44:00
>>fooker+(OP)
That must depend on the platform and the surrounding code, no?
replies(1): >>fooker+Fg
3. pclmul+X5[view] [source] 2024-01-16 13:57:56
>>fooker+(OP)
Compilers often under-generate conditional instructions. They implicitly assume (correctly) that most branches you write are 90/10 (ie very predictable), not 50/50. The branches that actually are 50/50 suffer from being treated as being 90/10.
replies(2): >>fooker+jg >>IainIr+oK1
4. jeffbe+Y6[view] [source] 2024-01-16 14:06:15
>>fooker+(OP)
FYI. https://quick-bench.com/q/sK9t9GoFDRkx9XxloUUbB8Q3ht4'

Using this microbenchmark on an Intel Sapphire Rapids CPU, compiled with march=k8 to get the older form, takes ~980ns, while compiling with march=native gives ~570ns. It's not at all clear that the imperfection the article describes is really relevant in context, because the compiler transforms this function into something quite different.

replies(1): >>fooker+6h
◧◩
5. fooker+jg[view] [source] [discussion] 2024-01-16 14:59:59
>>pclmul+X5
The branches in this example are not 50/50.

Given a few million calls of clamp, most would be no-ops in practice. Modern CPUs are very good at dynamically observing this.

replies(1): >>pclmul+XR2
◧◩
6. fooker+Fg[view] [source] [discussion] 2024-01-16 15:01:52
>>svanta+T3
Yes. On platform - most modern cpus are happier with predictable branches than exotic instructions.

On surrounding code - for sure.

◧◩
7. fooker+6h[view] [source] [discussion] 2024-01-16 15:03:25
>>jeffbe+Y6
With random test cases, branch prediction can't help.
◧◩
8. IainIr+oK1[view] [source] [discussion] 2024-01-16 21:53:20
>>pclmul+X5
It's hard to predict statically which branches will be dynamically unpredictable.

A seasoned hardware architect once told me that Intel went all-in on predication for Itanium, under the assumption that a Sufficiently Smart Compiler could figure it out, and then discovered to their horror that their compiler team's best efforts were not Sufficiently Smart. He implied that this was why Intel pushed to get a profile-guided optimization step added to the SPEC CPU benchmark, since profiling was the only way to get sufficiently accurate data.

I've never gone back to see whether the timeline checks out, but it's a good story.

replies(1): >>fooker+4L1
◧◩◪
9. fooker+4L1[view] [source] [discussion] 2024-01-16 21:57:12
>>IainIr+oK1
The compiler doesn't do much of the predicting, it's done by the CPU in runtime.
replies(1): >>kybore+1V1
◧◩◪◨
10. kybore+1V1[view] [source] [discussion] 2024-01-16 22:55:57
>>fooker+4L1
Not prediction, predication: https://en.wikipedia.org/wiki/Predication_(computer_architec...

By avoiding conditional branches and essentially masking out some instructions, you can avoid stalls and mis-predictions and keep the pipeline full.

Actually I think @IainIreland mis-remembers what the seasoned architect told him about Itanium. While Itanium did support predicated instructions, the problematic static scheduling was actually because Itanium was a VLIW machine: https://en.wikipedia.org/wiki/VLIW .

TL;DR: dynamic scheduling on superscalar out-of-order processors with vector units works great and the transistor overhead got increasingly cheap, but static scheduling stayed really hard.

◧◩◪
11. pclmul+XR2[view] [source] [discussion] 2024-01-17 06:26:04
>>fooker+jg
Do you know that for a fact? For all calls of clamp? I have definitely used min and max when they are true 50/50s and I assume clamp also gets some similar use.
replies(1): >>fooker+s43
◧◩◪◨
12. fooker+s43[view] [source] [discussion] 2024-01-17 08:07:27
>>pclmul+XR2
Modern compilers generate code assuming all branches are highly predictable.

If your use case does not follow that pattern and you really care about performance, you have to pull out something like inline assembly.

Consider software like ffmpeg which have to do this for the sake of performance.

[go to top]